Common regular expressions

Common regular expressions

Regular expressions are used for string processing and form verification. Some commonly used expressions are collected here for future use.

User name: /^[a-z0-9_-]{3,16}$/

Password: / ^ [a – z0 – _ – 9] {6} 16 $/

Hex value: /^#? ([a-f0-9]6}[a-f0-9]{3)$/

E-mail: / ^ ([a – z0-9 _. -] +) @ ([\ da – z -] +). ([a-z.] {2, 6}) $/

URL: / ^ (HTTPS? : / /)? ([\ da – z -] +). ([a-z.] {2, 6}) ([/ \ w. -]) /? $/

IP address: /^(? : (? : 25 [0 to 5] | 2 [0 to 4] [0-9] | [01]? [0-9] [0-9]?) .). {3} (? : 25 [0 to 5] | 2 [0 to 4] [0-9] | [01]? [0-9] [0-9]?) $/

HTML tag: /^<([a-z]+)([^<]+)(? : > (.).
|\s+/>)$/

The range of Chinese characters in Unicode encoding: /^[U4E00-U9FA5],{0,}$/

[\u4e00-\u9fa5]

Note: Matching Chinese is really a headache, with this expression is easy to do

Match double-byte characters (including Chinese characters) : [^\x00-\ XFF]

Note: Can be used to calculate the length of a string (2 for a double-byte character, 1 for ASCII characters)

Regular expressions that match blank lines: \n\s*\r

Comment: Can be used to delete blank lines

Regular expressions that match HTML tags: <(\S*?) [^ >] >.? < 1 > / \ | <. *? />

Note: The version circulating on the Internet is so bad that this one only matches parts and is still useless for complex nested tags

Regular expression matching fore and aft white space characters: ^ | \ \ s * * s $

Comment: a useful expression that can be used to remove whitespace (including Spaces, tabs, page feeds, and so on) at the beginning and end of a line

Matches the Email address of the regular expression: \ w + (+ / – +. \ w) + @ \ w ([-] \ w +) \ w + ([-] \ w +) *

Note: Useful for form validation

[A-za-z]+://[^\s]*

Comments: the version circulating on the Internet is very limited in function, the above one can basically meet the needs

Whether the matching account is valid (starting with a letter, 5-16 bytes allowed, alphanumeric underscores allowed) : ^[a-za-z][a-za-z0-9_]{4,15}$

Note: Useful for form validation

Match domestic phone number: \d3}-\d{8}\d{4-\d{7}

Comments: The matching form is 0511-4405222 or 021-87888822

Match Tencent QQ id: [1-9][0-9]{4,}

Comments: Tencent QQ number starts from 10000

Match China mainland zip code: [1-9]\d{5}(? ! \d)

Note: Postcodes in mainland China are 6 digits

Matching id card :(^\d15}$)(^\d{18$)

Remarks: The id card in mainland China has 15 or 18 digits

Match the IP address: \d+.\d+.\d+

Note: It is useful for extracting IP addresses

Match a specific number:

^[1-9]\d*$// Matches positive integers

^-[1-9]\d*$// Matches negative integers

^ -? [1-9]\d*$// Matches integers

^ 1-9] [\ | 0 $d * / / match non-negative integer (positive integer + 0)

^ – [1-9] \ d * | 0 $/ / match the positive integer (negative integers + 0)

^ 1-9] [\ d * \ d * | 0. [1-9] \ d \ d * * $/ / match is floating point number

^ – (1-9] [\ d * \ | 0. D * (1-9] \ d \ d * *) $/ / match negative floating point number

^ -? ([1-9]\d*.\d*|0.\d*[1-9]\d*|0? . | 0 + 0) $/ / match floating point number

^[1-9]\d*.\d*|0.\d*[1-9]\d*|0? . | 0 + 0 $/ / match nonnegative floating-point Numbers (is floating point number + 0)

^(-([1-9]\d*.\d*|0.\d*[1-9]\d*))|0? . | 0 + 0 $/ / match is a floating point number (negative floating-point number + 0)

Comments: useful when dealing with a large amount of data, pay attention to the revision of specific applications

Matches a specific string:

^[a-za-z]+$// Matches A string of 26 English letters

^[a-z]+$// Matches A string of 26 uppercase Letters

^[a-z]+$// Matches a string of 26 lowercase letters

^[a-za-z0-9]+$// Matches A string of numbers and 26 letters

^\w+$// Matches a string of digits, 26 letters, or underscores

Expression set

Regular expressions come in many different styles. The following table is a complete list of metacharacters in PCRE and their behavior in the context of regular expressions:

character describe
\ Marks the next character as a special character, or a literal character, or a backreference, or an octal escape. For example,”n“Match character”n“. “\n“Matches a newline character. Sequence”\ \“Match”\“And”\ [“Then match”(“.
^ Matches the start of the input string. If the Multiline property of the RegExp object is set, ^ also matches”\n“Or”\r“After the position.
$ Matches the end of the input string. If the Multiline property of the RegExp object is set, $also matches”\n“Or”\r“Before the position.
* Matches the preceding subexpression zero or more times. For example, zo* matches”z“And”zoo“. * is equivalent to {0,}.
+ Matches the previous subexpression one or more times. For example,”zo+“Match”zo“And”zoo“, but does not match”z“. + is equivalent to {1,}.
? Matches the preceding subexpression zero or once. For example,”do(es)?“Can match”do“Or”does“The”do“. ? Equivalent to {0,1}.
{n} nIs a non-negative integer. Matched determinednTimes. For example,”o{2}“No match”Bob“The”o“But it matches.”foodTwo O’s in “.
{n,} nIs a non-negative integer. At least matchnTimes. For example,”o{2,}“No match”Bob“The”o“But it matches.”fooooodAll o’s in “. “o{1,}“Equivalent to”o+“. “o{0,}“Is equivalent to”o*“.
{n.m} mandnAre non-negative integers, wheren< =m. At least matchnMatches at most timesmTimes. For example,”O {1, 3}“Will match”foooooodThe first three O’s in “. “O {0, 1}“Equivalent to”o?“. Note that there can be no Spaces between commas and numbers.
? When the character is immediately followed by any other qualifier (*,+,? , {n}, {n}, {n.m}), the matching pattern is non-greedy. The non-greedy mode matches as few strings as possible, while the default greedy mode matches as many strings as possible. For example, for the string”oooo“,”o+?“Will match a single”o“, and”o+“Will match all”o“.
. Match”\nAny single character other than “. Matches include”\nPlease use any character like[. \n]“.
(pattern) Matches pattern and gets the match. The Matches obtained can be obtained from the generated Matches collection, which is used in VBScript and in JScriptNine attributes. To match the parenthesis character, use the\ [“Or”\)“.
(? :pattern) Matches pattern but does not get the result, that is, it is a non-get match and is not stored for future use. This uses the or character”(|)“It is useful to combine the parts of a pattern. For example,”industr(? :y|ies)“It’s a comparison.”industry|industries“Shorter expression.
(? =pattern) Forward lookup matches the lookup string at the beginning of any string that matches pattern. This is a non-fetch match, that is, the match does not need to be fetched for later use. For example,”Windows(? =95|98|NT|2000)“Match”Windows2000“The”Windows“, but does not match”Windows3.1“The”Windows“. Prelookup does not consume characters, that is, after a match has occurred, the search for the next match begins immediately after the last match, rather than after the character containing the prelookup.
(? ! pattern) Negative prelookup matches the search string at the beginning of any string that does not match pattern. This is a non-fetch match, that is, the match does not need to be fetched for later use. For example,”Windows(? ! 95|98|NT|2000)“Match”Windows3.1“The”Windows“, but does not match”Windows2000“The”Windows“. Prelookup does not consume characters, that is, after a match has occurred, the search for the next match begins immediately after the last match, rather than after the character containing the prelookup
x|y Matches x or y. For example,”z|food“Match”z“Or”food“. “(z|f)ood“Then match”zood“Or”food“.
[xyz] Collection of characters. Matches any of the contained characters. For example,”[abc]“Can match”plain“The”a“.
[^xyz] A collection of negative characters. Matches any character that is not contained. For example,”[^abc]“Can match”plain“The”p“.
[a-z] Character range. Matches any character in the specified range. For example,”[a-z]“Can match”a“To”z“Any lowercase character in the range.
[^a-z] The range of negative characters. Matches any character that is not in the specified range. For example,”[^a-z]“Can match any not”a“To”zAny character in the “.
\b Match a word boundary, which is the position between words and Spaces. For example,”er\b“Can match”never“The”er“, but does not match”verb“The”er“.
\B Matches non-word boundaries. “er\B“Match”verb“The”er“, but does not match”never“The”er“.
\cx Matches the control character specified by x. For example, \cM matches a Control-m or carriage return character. The value of x must be either A-z or a-z. Otherwise, treat C as a primitive”c“Character.
\d Matches a numeric character. Equivalent to [0-9].
\D Matches a non-numeric character. That’s the same thing as ^0 minus 9.
\f Matches a feed character. This is equivalent to \x0c and \cL.
\n Matches a newline character. Equivalent to \x0a and \cJ.
\r Matches a carriage return. Equivalent to \x0d and \cM.
\s Matches any whitespace character, including Spaces, tabs, page feeds, and so on. Equivalent to [\f\n\r\t\v].
\S Matches any non-whitespace character. Equivalent to [^\f\n\r\t\v].
\t Matches a TAB character. Equivalent to \x09 and \cI.
\v Matches a vertical TAB character. Equivalent to \x0b and \cK.
\w Matches any word character including underscores. Is equivalent to”[A-Za-z0-9_]“.
\W Matches any non-word character. Is equivalent to”[^A-Za-z0-9_]“.
\xn matchingn, includingnIs a hexadecimal escape value. The hexadecimal escape value must be two digits long. For example,”\x41“Match”A“. “\x041“Is equivalent to”\x04&1“. ASCII encoding can be used in regular expressions. .
*num* matchingnum, includingnumIt’s a positive integer. A reference to the match obtained. For example,”(.). \ 1“Matches two consecutive identical characters.
*n* Identifies an octal escape value or a backward reference. If * nAt least beforenIs a subexpression, thennIs a backward reference. Otherwise, ifnIs an octal number (0-7), thenN * is an octal escape value.
*nm* Identifies an octal escape value or a backward reference. If * nmAt least beforenmObtain a subexpression, thennmIs a backward reference. If * nmAt least beforenGet, thennIs a heel textmA backward reference to. If none of the preceding conditions are met, ifnandmOctal digits (0-7), *nmOctal escape values will be matchedNm *.
*nml* ifnIs an octal number (0-3), andM and lAre octal numbers (0-7), then octal escape values are matchednmL.
\un matchingn, includingnIs a Unicode character represented by four hexadecimal digits. For example, \u00A9 matches the copyright symbol (?) .

Regular expression for efficiency

If the purpose is purely to challenge their level of regularity and to achieve some special effects (such as using regular expressions to compute prime numbers and solve linear equations), efficiency is not a problem. If a regular expression is written to satisfy only one or two or dozens of runs, it doesn’t make much of a difference whether it’s optimized or not. However, if you’re writing regular expressions that run millions, millions of times, efficiency is a big problem. Here are a few things I learned (from work, books, and my own experience) about how to run regular expressions more efficiently. If you have other experiences that are not covered here, you are welcome to share them.

For the convenience of writing, first define two concepts.

Mismatches: When the regular expression matches more content than it needs. Some text is “hit” by the written regular expression that does not meet the requirements. For example, if you use \d{11} to match an 11-digit phone number, \d{11} will not only match the correct phone number, it will also match a string like 98765432100 that is clearly not a phone number. We call such matches false matches.

Missing matches: When the range of what the regular expression matches is too narrow. Some text is really needed, but the written re does not cover this situation. For example, using \d{18} to match an 18-digit id number would miss the ending letter X.

When you write a regular expression, it is possible to have only false matches (with very loose conditions that are larger than the target text), only missing matches (describing one of many cases in the target text), and both false matches and missing matches. For example, if you use \w+.com to match a domain ending in.com, you will either mismatch a string like abc_.com (legitimate domain names do not contain underscores; \w contains underscores) or miss a domain name like ab-c.com (legitimate domain names can contain hyphens, but \w does not match hyphens).

Accurate regular expressions mean that there are no error-free matches and no leaky matches. Of course, there are situations where you can only see a limited amount of text and write rules based on that text, but those rules will be applied to a huge amount of text. In this case, the goal is to eliminate mismatches and missing matches as much as possible, if not completely, and to improve operational efficiency. The experience proposed in this paper is mainly aimed at this kind of situation.

Master grammatical details. The syntax of regular expressions is roughly the same in all languages, with varying details. Knowing the syntax details of the language’s re is the foundation for writing correct and efficient regular expressions.

** first rough before fine, ** first add before subtract. Regular expression syntax can be used to describe and define the target text, just like sketching. You can outline the framework first and then implement the details step by step. \d{11} \d{11} \d{11} The further refinement to 1[358]\d{9} is a big step forward (as to whether the second is 3, 5, 8, there is no intention to go into the details here, but this is just one example to illustrate the progressive refinement process). The goal is to eliminate missing matches (start with as many matches as you can, add) and then eliminate false matches bit by bit (subtract). In this way, it is not easy to make mistakes when considering, so as to advance toward the goal of “no mistakes and no leaks”.

Leave some room. The text sample that can be seen is limited, while the text to be matched is massive and temporarily invisible. In this case, it’s important to think outside the box of text that you can see when writing regular expressions, and to “think strategically”. For example, often receive such spam messages: “send * tickets”, “send # drift”. If you want to write rules to block such annoying spam messages, you should not only be able to write regular expressions that match the current text, but also be able to think of sending it. Drift: tickets | | wave) such as possible “variation”. There may be specific rules in specific areas, not to mention. The purpose of this is to eliminate missing matches and extend the life cycle of regular expressions.

Clear. In particular, use metacharacters such as periods sparingly, and avoid arbitrary quantifiers such as asterisks and plus signs as much as possible. If you can determine the range, such as \w, do not use the dot; Don’t use arbitrary quantifiers if you can predict the number of repetitions. For example, writing a script to extract a Twitter message assumes that the XML body part of a message is structured as… [^<]{1,480} is better than.* for two reasons. First, it uses [^<], which ensures that the scope of the text does not exceed the position of the next less than sign. The second is to clarify the length range, {1,480}, which is based on the approximate length range of a Twitter message. Of course, whether 480 is the right length is debatable, but the idea is worth learning. To put it bluntly, “Abusing the dot, asterisk and plus sign is environmentally irresponsible”.

Don’t crush the camel with straw. Use normal parentheses () instead of non-capture parentheses (? :…). A portion of memory is left waiting for you to access it again. This regular expression, run indefinitely, is like a pile of straw that finally breaks the camel’s back. Develop fair use (? :…). The parenthesis habit.

Prefer simplicity to complexity. Splitting a complex regular expression into two or more simple regular expressions reduces programming difficulty and improves running efficiency. For example is used to eliminate the beginning of a line and end-of-line blank characters of regular expression s / ^ \ s + | \ s +, its efficiency in theory than / / g; . This example comes from Chapter 5 of Mastering Regular Expressions, which reviews it as “almost always the fastest, and apparently the easiest to understand.” It’s fast and easy to understand. Why not? Work we still have other reasons to the C = = (A | B) the regular expression into A and B, respectively, two expressions. Although A and B, for example, in both cases as long as there is A way to hit the required text pattern matching will be successful, but if as long as there is A striped expression (for example A) can produce false matching, so whatever other subexpressions (e.g. B) how to high efficiency, how accurate, the overall accuracy of C will also be affected by A.

Smart positioning. Sometimes we need to match the as the word (with Spaces on both sides) rather than the ordered arrangement of t-H-e as part of the word (the in together, for example). Using anchors such as ^, $, \b at appropriate times can effectively improve the efficiency of finding successful matches and eliminating unsuccessful matches.