Regular expressions (regular, regular, and RE) are the most common programming tricks in Python, and many times a good regular expression is worth dozens of lines of code. For example, match (verify) the mailbox, ID card, mobile phone number, IP address, URL, and HTML.

A regular expression is a special sequence of characters that contains predefined patterns (rules) that can be used to match and verify other strings (text, web pages, etc.).

The difficulty of mastering regular expressions, however, is that they contain a lot of basic pattern syntax to memorize, and these basic pattern syntax can be combined to produce infinite variations.

Therefore, it is not recommended to memorize the basic pattern of regular grammar, can be used as you look up, use more, will naturally form a mechanical memory.

Basic schema syntax

Character range matching

Regular expression instructions correct error
A Matches a single character exactly A a
x|y Two characters allowed y n
[xyz] A collection of characters that allows any single character in the collection to appear z c
[a-z] [A-Z] [0-9] Range of characters a D 8 A a A
[^xyz] [^0-9] Characters in collection are not allowed 0 A y 8

metacharacters

Regular expression instructions correct error
\d Matches any single number 8 i
\D Matches any single character other than the \d rule i 8
\w Matches any single alphanumeric underscore Y &
\W Matches any single character other than \w & Y
\s Matching a single space x
\n Matches a single newline character x
. Matches any single character (except newline)
\. Special characters are matched only. . 1

Multiple repeat matching

Regular expression instructions correct error
A{3} Exact N times of matching AAA AA
A{3,} At least N times AAA AA
\ d {3, 5} Specify the minimum and maximum number of occurrences 1234 12
\d* Can occur zero to infinite times, equivalent to {0,} 1234
\d+ At least once, equivalent to {1,} 12
\d? Occurs at most once, equivalent to {0,1} 1 12

Locate matching

Regular expression instructions correct error
^A.* Head match ABC CBA
.*A$ End of match CBA ABC
^A.*A$ Full word horse ACCCA ACCC

Regular matching process

import re

# regular expression for password strength
re_password = re.compile(r'^(? =.*\\d)(? =.*[a-z])(? =. * [a-z]). 8, 10 {} $')
Copy the code

At the same time, when re.compiler(pattern[, flags]), you can select regular expression modifiers to control the matching pattern.

Details are shown in the following table:

The modifier describe
re.I Make the match case insensitive
re.L Do location-aware matching
re.M Multi-line matching, affecting ^ and $
re.S Causes. To match all characters, including newlines
re.U Resolves characters according to the Unicode character set. This sign affects \W, \w, \b, \b
re.X This flag allows you to make your regular expressions easier to understand by giving you a more flexible format

matching

re.match(pattern, string, flags=0)

Matches from the start position of the string, or returns None if the start position does not match, which requires special attention.

Flags is the modifier of the regular expression.

import re

print(re.match(r'hello'.'Hello world', re.I))
print(re.match(r'world'.'Hello world', re.I))

<_sre.SRE_Match object at 0x7f2c7c626648>
None
Copy the code

re.search(pattern, string, flags=0)

Unlike the match method, it scans the entire string and returns the first successful match.

import re

print(re.search(r'hello'.'Hello world', re.I))
print(re.search(r'world'.'Hello world', re.I))

<_sre.SRE_Match object at 0x7fbcf9945648>
<_sre.SRE_Match object at 0x7fbcf9945648>
Copy the code

As you can see, the search method successfully found the world string.

If we want to output matching results, we can use groups and groups.

import re

print(re.search(r'world'.'Hello world', re.I).group(0))
print(re.search(r'(world)'.'Hello world', re.I).groups())

world
('world'.)Copy the code

One idea is that () applies to groups in regular expressions.

One more thing to note is that regular expression matches default to greedy matches.

import re

print(re.match(r'^(\d+)(0*)$'.'102300').groups())
print(re.match(r'^(\d+?) (0 *) $'.'102300').groups())

('102300'.' ')
('1023'.'00')
Copy the code

\d+ = 0* = 0* = 0* = 0* = 0* = 0*

You must make \d+ use non-greedy matches (that is, as few matches as possible) in order to match the following zeros and add? Can be used by \d+.

Match and search can only match once, but if you want to match all, you can use findAll and finditer.

findall(string[, pos[, endpos]]

import re
 
# match number
pattern = re.compile(r'\d+')
result1 = pattern.findall('runoob 123 google 456')
result2 = pattern.findall('run88oob123google456'.0.10)
 
print(result1)
print(result2)

['123'.'456']
['88'.'12']
Copy the code

finditer(pattern, string, flags=0)

import re

# match number
it = re.finditer(r"\d+"."12a32bc43jf3") 
for match in it: 
    print (match.group())

12 
32 
43 
3
Copy the code

In addition to pure matching, there will be a need to split and replace, so here are two methods:

re.split(pattern, string[, maxsplit=0, flags=0])

import re
print(re.split('\W+'.'runoob, runoob, runoob.'))'runoob'.'runoob'.'runoob'.' ']
Copy the code

re.sub(pattern, repl, string, count=0, flags=0)

import re

dt = '2020-01-01'
print(re.sub(r'\D'.' ', dt))

2020 01 01
Copy the code

Common regular expressions

  1. Verification password strength:

    ^ (? =.*\\d)(? =.*[a-z])(? =. * [a-z]). 8, 10 {} $

  2. Calibration Chinese:

    ^[\\u4e00-\\u9fa5]{0,}$

  3. A string of numbers, 26 letters, or underscores:

    ^\\w+$

  4. Verification Email Address:

    [\\w!#$%&'*+/=?^_`{|}~-]+(? :\\.[\\w!#$%&'*+/=?^_`{|}~-]+)*@(? :[\\w](? :[\\w-]*[\\w])? \ \.) +[\\w](? :[\\w-]*[\\w])?

  5. Check ID card number

    15:

    ^[1-9]\\d{7}((0\\d)|(1[0-2]))(([0|1|2]\\d)|3[0-1])\\d{3}$

    18:

    ^[1-9]\\d{5}[1-9]\\d{3}((0\\d)|(1[0-2]))(([0|1|2]\\d)|3[0-1])\\d{3}([0-9]|X)$

  6. Verification Mobile phone Number:

    ^(13[0-9]|14[5|7]|15[0|1|2|3|5|6|7|8|9]|18[0|1|2|3|5|6|7|8|9])\\d{8}$

  7. IP address:

    v4:

    \\b(? : (? : 25 [0 to 5] | 2 [0 to 4] [0-9] | [01]? [0-9] [0-9]?) \ \.) {3} (? : 25 [0 to 5] | 2 [0 to 4] [0-9] | [01]? [0-9] [0-9]?) \\b

    v6:

    (([0-9 a - fA - F] {1, 4} {7, 7}) [0-9 a - fA - F] {1, 4} | ([0-9 a - fA - F] {1, 4} {1, 7}) : | ([0-9 a - fA - F] {1, 4} {1, 6}) : [0-9 a - fA - F] {1, 4} | ([0-9 - fA - a F] {1, 4} {1, 5}) (: [0-9 a - fA - F] {1, 4}, {1, 2} | ([0-9] a - fA - F {1, 4} {1, 4}) (: [0-9 a - fA - F] {1, 4}, {1, 3} | ([0-9 a - fA - F] {1, 4} {1, 3}) (: [0-9 - a FA - F] {1, 4} {1, 4} | ([0-9 a - fA - F] {1, 4}} {1, 2) (: [0-9 a - fA - F] {1, 4}, {1, 5} | [0-9 a - fA - F] {1, 4} : ((: [0-9 a - fA - F] {1, 4}, {1, 6}) | : ((: [0-9 - a FA - F] {1, 4} {1, 7} | :) | fe80: : [0-9 a - fA - F] {0, 4}) {0, 4} % [0-9 a zA - Z] {1} | : : (FFFF (: 0 {1, 4}) {0, 1} {0, 1} ((25) [0 to 5) | (2 [0 to 4] | 1 {0, 1} [0 9]) {0, 1} [0-9]) \ \.) {3} (25 [0 to 5) | (2 [0 to 4] | 1 {0, 1} [0-9]) {0, 1} [0-9]) | ([0-9 a - fA - F] {1, 4} {1, 4}) : ((25 [0 to 5) | (2 [0 to 4] | 1 {0, 1} [0-9]) {0, 1} [0-9]) \ \.) {3} (25 [0 to 5) | (2 | 1 [0-4] [0-9] {0, 1}) [0-9] {0, 1}))

  8. Extract page hyperlinks:

    (<a\\s*(? ! .*\\brel=)[^>]*)(href="https? : \ \ / \ \ / ((?) ! (? : (? :www\\.) ? '.implode('|(? :www\\.) ? ', $follow_list).'))[^"]+)"((? ! .*\\brel=)[^>]*)(? : [^ >] *) >

  9. Calibration date:

    ^ (? : (? ! 0000) [0-9] {4} - (? : (? : 0 | [1-9] [0-2] 1) - (? : 0 [1-9] [0-9] | | 1 2 [0 to 8]) | (? : 0 [9] 13 - | [0-2] 1) - (? 30) : 29 | | (? : 0 [13578] 1 [02]) - 31) | | (? : [0-9] {2} (? : 0 [48] | [2468] [048] | [13579] [26]) | (? : 0 [48] | [2468] [048] | [13579] [26]) 00) - 02 - $29)

  10. Check amount:

    ^ [0-9] + (. [0-9] {2})? $

(Double click to copy directly for easy use)

Finally, we have a book entitled “Understanding NLP Chinese Word Segmentation: From Principle to Practice”, which will help you master Chinese word segmentation from scratch and step into the door of NLP.

If the above content is helpful to you, I hope you can help me point a like, transfer a hair, comment.