Regular expressions (regular, regular, and RE) are the most common programming tricks in Python, and many times a good regular expression is worth dozens of lines of code. For example, match (verify) the mailbox, ID card, mobile phone number, IP address, URL, and HTML.
A regular expression is a special sequence of characters that contains predefined patterns (rules) that can be used to match and verify other strings (text, web pages, etc.).
The difficulty of mastering regular expressions, however, is that they contain a lot of basic pattern syntax to memorize, and these basic pattern syntax can be combined to produce infinite variations.
Therefore, it is not recommended to memorize the basic pattern of regular grammar, can be used as you look up, use more, will naturally form a mechanical memory.
Basic schema syntax
Character range matching
Regular expression | instructions | correct | error |
---|---|---|---|
A | Matches a single character exactly | A | a |
x|y | Two characters allowed | y | n |
[xyz] | A collection of characters that allows any single character in the collection to appear | z | c |
[a-z] [A-Z] [0-9] | Range of characters | a D 8 | A a A |
[^xyz] [^0-9] | Characters in collection are not allowed | 0 A | y 8 |
metacharacters
Regular expression | instructions | correct | error |
---|---|---|---|
\d | Matches any single number | 8 | i |
\D | Matches any single character other than the \d rule | i | 8 |
\w | Matches any single alphanumeric underscore | Y | & |
\W | Matches any single character other than \w | & | Y |
\s | Matching a single space | x | |
\n | Matches a single newline character | x | |
. | Matches any single character (except newline) | — | — |
\. | Special characters are matched only. | . | 1 |
Multiple repeat matching
Regular expression | instructions | correct | error |
---|---|---|---|
A{3} | Exact N times of matching | AAA | AA |
A{3,} | At least N times | AAA | AA |
\ d {3, 5} | Specify the minimum and maximum number of occurrences | 1234 | 12 |
\d* | Can occur zero to infinite times, equivalent to {0,} | 1234 | — |
\d+ | At least once, equivalent to {1,} | 12 | |
\d? | Occurs at most once, equivalent to {0,1} | 1 | 12 |
Locate matching
Regular expression | instructions | correct | error |
---|---|---|---|
^A.* | Head match | ABC | CBA |
.*A$ | End of match | CBA | ABC |
^A.*A$ | Full word horse | ACCCA | ACCC |
Regular matching process
import re
# regular expression for password strength
re_password = re.compile(r'^(? =.*\\d)(? =.*[a-z])(? =. * [a-z]). 8, 10 {} $')
Copy the code
At the same time, when re.compiler(pattern[, flags]), you can select regular expression modifiers to control the matching pattern.
Details are shown in the following table:
The modifier | describe |
---|---|
re.I | Make the match case insensitive |
re.L | Do location-aware matching |
re.M | Multi-line matching, affecting ^ and $ |
re.S | Causes. To match all characters, including newlines |
re.U | Resolves characters according to the Unicode character set. This sign affects \W, \w, \b, \b |
re.X | This flag allows you to make your regular expressions easier to understand by giving you a more flexible format |
matching
re.match(pattern, string, flags=0)
Matches from the start position of the string, or returns None if the start position does not match, which requires special attention.
Flags is the modifier of the regular expression.
import re
print(re.match(r'hello'.'Hello world', re.I))
print(re.match(r'world'.'Hello world', re.I))
<_sre.SRE_Match object at 0x7f2c7c626648>
None
Copy the code
re.search(pattern, string, flags=0)
Unlike the match method, it scans the entire string and returns the first successful match.
import re
print(re.search(r'hello'.'Hello world', re.I))
print(re.search(r'world'.'Hello world', re.I))
<_sre.SRE_Match object at 0x7fbcf9945648>
<_sre.SRE_Match object at 0x7fbcf9945648>
Copy the code
As you can see, the search method successfully found the world string.
If we want to output matching results, we can use groups and groups.
import re
print(re.search(r'world'.'Hello world', re.I).group(0))
print(re.search(r'(world)'.'Hello world', re.I).groups())
world
('world'.)Copy the code
One idea is that () applies to groups in regular expressions.
One more thing to note is that regular expression matches default to greedy matches.
import re
print(re.match(r'^(\d+)(0*)$'.'102300').groups())
print(re.match(r'^(\d+?) (0 *) $'.'102300').groups())
('102300'.' ')
('1023'.'00')
Copy the code
\d+ = 0* = 0* = 0* = 0* = 0* = 0*
You must make \d+ use non-greedy matches (that is, as few matches as possible) in order to match the following zeros and add? Can be used by \d+.
Match and search can only match once, but if you want to match all, you can use findAll and finditer.
findall(string[, pos[, endpos]]
import re
# match number
pattern = re.compile(r'\d+')
result1 = pattern.findall('runoob 123 google 456')
result2 = pattern.findall('run88oob123google456'.0.10)
print(result1)
print(result2)
['123'.'456']
['88'.'12']
Copy the code
finditer(pattern, string, flags=0)
import re
# match number
it = re.finditer(r"\d+"."12a32bc43jf3")
for match in it:
print (match.group())
12
32
43
3
Copy the code
In addition to pure matching, there will be a need to split and replace, so here are two methods:
re.split(pattern, string[, maxsplit=0, flags=0])
import re
print(re.split('\W+'.'runoob, runoob, runoob.'))'runoob'.'runoob'.'runoob'.' ']
Copy the code
re.sub(pattern, repl, string, count=0, flags=0)
import re
dt = '2020-01-01'
print(re.sub(r'\D'.' ', dt))
2020 01 01
Copy the code
Common regular expressions
-
Verification password strength:
^ (? =.*\\d)(? =.*[a-z])(? =. * [a-z]). 8, 10 {} $
-
Calibration Chinese:
^[\\u4e00-\\u9fa5]{0,}$
-
A string of numbers, 26 letters, or underscores:
^\\w+$
-
Verification Email Address:
[\\w!#$%&'*+/=?^_`{|}~-]+(? :\\.[\\w!#$%&'*+/=?^_`{|}~-]+)*@(? :[\\w](? :[\\w-]*[\\w])? \ \.) +[\\w](? :[\\w-]*[\\w])?
-
Check ID card number
15:
^[1-9]\\d{7}((0\\d)|(1[0-2]))(([0|1|2]\\d)|3[0-1])\\d{3}$
18:
^[1-9]\\d{5}[1-9]\\d{3}((0\\d)|(1[0-2]))(([0|1|2]\\d)|3[0-1])\\d{3}([0-9]|X)$
-
Verification Mobile phone Number:
^(13[0-9]|14[5|7]|15[0|1|2|3|5|6|7|8|9]|18[0|1|2|3|5|6|7|8|9])\\d{8}$
-
IP address:
v4:
\\b(? : (? : 25 [0 to 5] | 2 [0 to 4] [0-9] | [01]? [0-9] [0-9]?) \ \.) {3} (? : 25 [0 to 5] | 2 [0 to 4] [0-9] | [01]? [0-9] [0-9]?) \\b
v6:
(([0-9 a - fA - F] {1, 4} {7, 7}) [0-9 a - fA - F] {1, 4} | ([0-9 a - fA - F] {1, 4} {1, 7}) : | ([0-9 a - fA - F] {1, 4} {1, 6}) : [0-9 a - fA - F] {1, 4} | ([0-9 - fA - a F] {1, 4} {1, 5}) (: [0-9 a - fA - F] {1, 4}, {1, 2} | ([0-9] a - fA - F {1, 4} {1, 4}) (: [0-9 a - fA - F] {1, 4}, {1, 3} | ([0-9 a - fA - F] {1, 4} {1, 3}) (: [0-9 - a FA - F] {1, 4} {1, 4} | ([0-9 a - fA - F] {1, 4}} {1, 2) (: [0-9 a - fA - F] {1, 4}, {1, 5} | [0-9 a - fA - F] {1, 4} : ((: [0-9 a - fA - F] {1, 4}, {1, 6}) | : ((: [0-9 - a FA - F] {1, 4} {1, 7} | :) | fe80: : [0-9 a - fA - F] {0, 4}) {0, 4} % [0-9 a zA - Z] {1} | : : (FFFF (: 0 {1, 4}) {0, 1} {0, 1} ((25) [0 to 5) | (2 [0 to 4] | 1 {0, 1} [0 9]) {0, 1} [0-9]) \ \.) {3} (25 [0 to 5) | (2 [0 to 4] | 1 {0, 1} [0-9]) {0, 1} [0-9]) | ([0-9 a - fA - F] {1, 4} {1, 4}) : ((25 [0 to 5) | (2 [0 to 4] | 1 {0, 1} [0-9]) {0, 1} [0-9]) \ \.) {3} (25 [0 to 5) | (2 | 1 [0-4] [0-9] {0, 1}) [0-9] {0, 1}))
-
Extract page hyperlinks:
(<a\\s*(? ! .*\\brel=)[^>]*)(href="https? : \ \ / \ \ / ((?) ! (? : (? :www\\.) ? '.implode('|(? :www\\.) ? ', $follow_list).'))[^"]+)"((? ! .*\\brel=)[^>]*)(? : [^ >] *) >
-
Calibration date:
^ (? : (? ! 0000) [0-9] {4} - (? : (? : 0 | [1-9] [0-2] 1) - (? : 0 [1-9] [0-9] | | 1 2 [0 to 8]) | (? : 0 [9] 13 - | [0-2] 1) - (? 30) : 29 | | (? : 0 [13578] 1 [02]) - 31) | | (? : [0-9] {2} (? : 0 [48] | [2468] [048] | [13579] [26]) | (? : 0 [48] | [2468] [048] | [13579] [26]) 00) - 02 - $29)
-
Check amount:
^ [0-9] + (. [0-9] {2})? $
(Double click to copy directly for easy use)
Finally, we have a book entitled “Understanding NLP Chinese Word Segmentation: From Principle to Practice”, which will help you master Chinese word segmentation from scratch and step into the door of NLP.
If the above content is helpful to you, I hope you can help me point a like, transfer a hair, comment.