What is a regular expression?

A regular expression is a type of text that is used to retrieve text that conforms to certain patterns.

A regular expression matches a string from left to right. The term “Regular Expression” is so long that we often use the abbreviation “regex” or “regexp”. Regular expressions can be used to replace text in strings, validate forms, extract strings from a string based on pattern matching, and so on.

1. Basic match

Regular expressions are just patterns we use to retrieve letters and numbers in text. For example, the regular expression cat indicates that the letter C is followed by the letter A and then the letter T.

"cat" => The cat sat on the mat

The regular expression 123 matches the string “123”. Regular matching is done by comparing each character in the regular expression to each character in the string to be matched. Regular expressions are case sensitive, so the regular expression Cat does not match the string “Cat”.

"Cat" => The cat sat on the Cat

2. Metacharacters

Metacharacters are the basic elements of regular expressions. The metacharacter here does not mean what it normally means, but is interpreted in a special way. Some metacharacters have special meanings when written inside square brackets. Metacharacters are as follows:

metacharacters describe
. Matches any character except newline.
[] Character class that matches any character contained in square brackets.
(^) Negate character classes. Matches any character not contained in square brackets
* Matches the preceding subexpression zero or more times
+ Matches the previous subexpression one or more times
? Matches the preceding subexpression zero or once, or indicates a non-greedy qualifier.
{n,m} Curly braces match the preceding character at least n times, but not more than m times.
(xyz) Group of characters that match the character XYZ in the exact order.
| A branching structure that matches the character before or after a symbol.
\ Escape character, which restores the original meaning of metacharacters and allows you to match reserved characters[] () {}. * +? ^ $\ |
^ Start of matching line
$ End of matching line

2.1 English period

The English period. Is the simplest example of a metacharacter. Metacharacters. Can match any single character. It does not match newline and newline characters. For example, the regular expression. Ar indicates that any character is followed by a letter A and then r.

".ar" => The car parked in the garage.

2.2 character set

Character sets are also called character classes. Square brackets are used to specify the character set. Use a hyphen in the character set to specify a character range. The order of the character ranges in square brackets is not important. For example, the regular expression [Tt]he is used to represent uppercase T or lowercase T followed by h and then e.

"[Tt]he" => The car parked in the garage.

However, the English period in the character set represents its literal meaning. The regular expression ar[.] represents the lowercase letter a, followed by the letter R, followed by an English period. Characters.

"ar[.]" => A garage is a good place to park a car.

2.2.1 Disavowing character sets

Generally, the insert character ^ indicates the beginning of a string, but when it occurs within square brackets, it cancels the character set. For example, the regular expression [^c]ar indicates any character except the letter C, followed by the character A, followed by the letter R.

"[^c]ar" => The car parked in the garage.

2.3 repeat

The following metacharacters +, * or? Used to specify how many times a subpattern can occur. These metacharacters have different roles in different situations.

2.3.1 the asterisk

The symbol * represents zero or more times that a matching rule has been matched. The regular expression A * indicates that the lowercase letter A can be repeated zero or more times. However, if it occurs after a character set or character class, it indicates the repetition of the entire character set. For example, the regular expression [a-z]* indicates that a line can contain any number of lowercase letters.

"[a-z]*" => The car parked in the garage# 21.

The * symbol can be used with the meta symbol. To match any string.*. The * symbol can be used with the space character \s to match a string of space characters. For example, the regular expression \ S *cat\s* represents zero or more Spaces, followed by a lowercase c, followed by a lowercase A, followed by a lowercase T, followed by zero or more Spaces.

"\s*cat\s*" => The fat cat sat on the cat.

2.3.2 plus

The symbol + matches a character one or more times. For example, the regular expression C.+t indicates a lowercase letter C followed by any number of characters followed by a lowercase letter T.

"c.+t" => The fat cat sat on the mat.

2.3.3 the question mark

In regular expressions, metacharacters? Used to indicate that the preceding character is optional. The symbol matches the previous character zero times or once. For example, regular expression [T]? He: optional uppercase letter T followed by lowercase letter H followed by lowercase letter E.

"[T]he" => The car is parked in the garage.
"[T]? he" =>The car is parked in the garage.

2.4 curly braces

In regular expressions curly braces (also known as quantifiers?) Used to specify how many times a character or group of characters can be repeated. For example, regular expression [0-9]{2,3} is used to match at least two digits but not more than three digits (characters ranging from 0 to 9).

"[0-9]{2,3}" => The number was 9.9997 but we rounded it off to 10. 0.

We can omit the second number. For example, the regular expression [0-9]{2,} is used to match two or more digits. If we also remove the comma, then the regular expression [0-9]{2} says: matches exactly 2-digit numbers.

"[0-9]{2,}" => The number was 9.9997 but we rounded it off to 10. 0.
"[0-9]{2}" => The number was 9.9997 but we rounded it off to 10. 0.

2.5 character groups

A character group is a set of subpatterns written in parentheses (…) . As we discussed in regular expressions, if we place a quantifier after a character, it repeats the previous character. However, if we place a quantifier after a character group, it repeats the entire character group. For example, the regular expression (ab)* matches zero or more strings “ab”. We can also use yuan in character groups | characters. Such as a regular expression (c | g | p) ar, said: lowercase c, g, or p behind with the letters a, followed by the letter r.

"(c|g|p)ar" => The car is parked in the garage.

2.6 Branch Structure

| vertical bars in the regular expression used to define the branch structure, branch structure like conditions between multiple expressions. Now you might think that this character set works the same way as branch offices. But the big difference between character sets and branching is that character sets only work at the character level, whereas branching works at the expression level. Such as regular expressions (T) | T he | car, said: a capital letter T T or lowercase letters, followed by a lowercase letter h, followed by a lowercase letter e or lowercase c, followed by a lowercase letters a, followed by a lowercase letter r.

"(T|t)he|car" => The car is parked in the garage.

2.7 Escaping Special Characters

A backslash \ is used in a regular expression to escape the next character. This will allow you to use the reserved characters as characters [] {} / \ + *. $^ |? . A special character is preceded by \, which can be used as a matching character. For example, the regular expression. Is used to match any character except newline. Now, in the input string matching characters, regular expressions (f | | c m) at \.? Is the lowercase letter f, C, or m followed by lowercase letter A, followed by lowercase letter T, and optional. Characters.

"(f|c|m)at\.?" => The fat cat sat on the mat.

2.8 locator

In regular expressions, to check whether a matching symbol is a starting or ending symbol, we use a locator. There are two types of locator: the first type is ^, which checks if the matching character is the beginning character, and the second type is $, which checks if the matching character is the last character in the input string.

2.8.1 Caret

Caret ^ symbol is used to check if the matching character is the first character of the input string. If we use the regular expression ^a (if a is the starting symbol) to match the string ABC, it matches to A. But if we use the regular expression ^b, it matches nothing, because “b” is not the starting character in the string ABC. Let’s look at another regular expression ^ (T) | T he, it said: a capital letter T or lowercase letters T is the input string starting symbol, followed by a lowercase letter h, followed by a lowercase letter e.

"(T|t)he" => The car is parked in the garage.
"^(T|t)he" => The car is parked in the garage.

2.8.2 dollar sign

The dollar $sign is used to check if the matching character is the last character in the input string. For example, regular expressions (at\.) $represents: lowercase letter A followed by lowercase letter T followed by a. Character, and the matcher must be the end of the string.

"(at\.) " => The fat cat. sat. on the mat.
"(at\.) $" => The fat cat sat on the mat.

3. Abbreviated character set

Regular expressions provide abbreviations for common character sets and common regular expressions. The abbreviated character set is as follows:

shorthand describe
. Matches any character except newline
\w Matches all letters and numbers:[a-zA-Z0-9_]
\W Matches non-alphanumeric characters:[^\w]
\d Matching number:[0-9]
\D Match non-numbers:[^\d]
\s Match Spaces:[\t\n\f\r\p{Z}]
\S Matches non-spaces:[^\s]

4. The assertion

Subsequent and prior assertions, sometimes called assertions, are special types of non-capture groups (used to match patterns, but not included in the match list). Assertions take precedence when we have patterns before or after a particular pattern. For example, we want to get all the numbers before the $character in the input strings $4.44 and $10.88. We can use this regular expression (? <=\$)[0-9\.]*; Characters. Here are the assertions used in regular expressions:

symbol describe
? = Forward antecedent assertion
? ! Negative prior assertion
? < = Forward after assertion
? <! Negative trailing assertion

4.1 Forward prior assertion

The forward prior assertion assumes that the expression of the first part must be a prior assertion expression. The match result returned contains only the text that matches the expression in the first part. To define a forward prior assertion in parentheses, the question mark and equal sign are used as follows (? =)… . The predicate expression is written after an equal sign in parentheses. For example, a regular expression (T | T) he (? =\sfat), which means: matches either the uppercase letter T or the lowercase letter T, followed by the letter H, followed by the letter E. In parentheses, we define forward first assertions, which direct The regular expression engine to match The or The followed by FAT.

"(T|t)he(? =\sfat)" =>The fat cat sat on the mat.

4.2 Negative forward assertion

When we need to retrieve the content of a mismatched expression from an input string, we use negative prior assertion. Negative antecedent assertion is defined the same as positive antecedent assertion, the only difference is not equal =, we use the negation sign! , such as (? ! …). . Let’s take a look at the following regular expressions (T | T) he (? ! \sfat), which means: Get all The or The from The input string without matching fat with a space character.

"(T|t)he(? ! \sfat)" => The fat cat sat onthe mat.

4.3 Forward and backward assertion

Forward-trailing assertions are used to get all matches up to a particular pattern. Forward trailing assertions are represented by (? < =…). . For example, regular expressions (? < = (T) | T he \ s) (fat | mat), said: obtained from The input string in The word all The fat and mat or after The word.

"(? <=(T|t)he\s)(fat|mat)" => Thefat cat sat on the mat."

4.4 Negative backward assertion

Negative trailing assertions are used to get all matches that do not precede a particular pattern. Negative trailing assertions are expressed as (?

"(? <! (T|t)he\s)(cat)" => The cat sat oncat.

5. Mark

The tag is also called a modifier because it modifies the output of the regular expression. These flags can be used in any order or combination and are part of a regular expression.

tag describe
i Case insensitive: Sets the match to case insensitive.
g Global search: Searches for all matches in the entire input string.
m Multi-line matching: Matches each line of the input string.

5.1 Case insensitive

The I modifier is used to perform case-insensitive matching. For example, The regular expression /The/gi indicates an uppercase letter T, followed by a lowercase letter H, and then e. But at the end of the regular match the I flag tells the regular expression engine to ignore this situation. As you can see, we also use the G flag because we are searching for matches throughout the input string.

"The" => The fat cat sat on the mat.
"/The/gi" => The fat cat sat on the mat.

5.2 Global Search

The G modifier is used to perform a global match (all matches are found and do not stop at the first match). For example, the regular expression /.(at)/g can be any character except newline, followed by lowercase letter A and t. Because we use the G flag at the end of the regular expression, it will find every match in the entire input string.

".(at)" => The fat cat sat on the mat.
"/.(at)/g" => The fat cat sat on the mat.

5.3 Multi-line Matching

The M modifier is used to perform multi-line matching. As we discussed earlier (^, $), use a locator to check whether the matching character starts or ends the input string. But we want to use a locator for each line, so we use the M modifier. For example, regular expression /at(.) ? $/gm: lowercase letter A followed by lowercase letter T matches any character except newline zero or once. And because of the M tag, the regular expression engine now matches the end of every line in the string.

"/.at(.) ? $/" => The fat cat sat on themat.
"/.at(.) ? $/gm" => Thefat
                  cat sat
                  on the mat.

Common regular expressions

  • Positive integer: ^\d+$
  • Negative integer: ^-\d+$
  • The phone number: ^ +? [\d\s]{3,}$
  • The phone code: ^ +? [\d\s]+(? [\d\s]{10,}$
  • The integer: ^ -? \d+$
  • The user name: ^ [\ w \ d_] 16th {4} $
  • Alphanumeric character: ^[a-zA-Z0-9]*$
  • Alphanumeric characters with Spaces: ^[a-zA-Z0-9 ]*$
  • password: ^ (? = ^. (({6} $)? =.*[A-Za-z0-9])(? =.*[A-Z])(? =.*[a-z]))^.*$
  • E-mail: ^ ([a zA - Z0-9. _ % -] + @ [a - zA - Z0-9. -] + \. [a zA - Z] {2, 4}) * $
  • IPv4 address: ^ ((? : (? : 25 [0 to 5] | 2 [0 to 4] [0-9] | [01]? [0-9] [0-9]?) \.) {3} (? : 25 [0 to 5] | 2 [0 to 4] [0-9] | [01]? [0-9] [0-9]?) ) * $
  • Lowercase letters: ^([a-z])*$
  • The capital letters: ^([A-Z])*$
  • The url: ^(((http|https|ftp):\/\/)? ([[a-zA-Z0-9]\-\.])+(\.) ([[a - zA - Z0-9]]) {2, 4} ([[9] a - zA - Z0 - \ / + = % & _ \. ~? \] - *)) * $
  • VISA Credit Card Number: ^ (4 [0-9] {12} (? : [0-9] {3})?) * $
  • Date (MM/DD/YYYY): ^ (0? [1-9] | 1 [012] [- /.] (0? | [1-9] [12] [0-9] [01] | 3) / - /. (20) 19 |? [0-9] {2} $
  • Date (YYYY/MM/DD): 19 | ^ (20)? [0-9] {2} [- /.] (0? [1-9] | 1 [012] [- /.] (0? | [1-9] [12] [0-9] [01]) $| 3
  • Mastercard card number: ^ (5 [1-5] [0-9] {14}) * $

extension