I. Summary of common matching rules:

model describe
\w Matches alphanumeric and underscore
\W Matches non-alphanumeric and underscore
\s Matches any whitespace character, equivalent to [\t\n\r\f].
\S Matches any non-null character
\d Matches any number, equivalent to [0-9]
\D Matches any non-number
\A Start of matching string
\Z Matches the end of the string. If there is a newline, only the end string before the newline is matched
\z End of matching string
\G Match The position where the match was completed
\n Matches a newline character
\t Matches a TAB character
^ Matches the beginning of the string
$ Matches the end of the string
. Matches any character except newline. When the re.DOTALL flag is specified, any character including newline can be matched
[...]. Used to represent a set of characters, listed separately: [amk] matches ‘a’, ‘m’ or ‘k’
[^...]. Characters not in [] : matches characters other than a, B, and c.
* Matches zero or more expressions.
+ Matches one or more expressions.
? Matches zero or one fragment defined by the previous regular expression, in a non-greedy manner
{n} Matches exactly n preceding expressions.
{n, m} Matches fragments defined by the previous regular expression n to m times, greedy way
a | b Matches a or B (delimiter in the middle)
( ) Matches the expression in parentheses, also representing a group

Two, universal matching symbol

The. (dot) matches any character (except newline), and the * (star) matches the preceding character an infinite number of times, so when combined, the.* matches any character without having to match each character.

Greedy matching:.* can match as many characters as you want, but normally it will match as many characters as possible, as shown in the following example:


import re

content = 'Hello 1234567 World_This is a Regex Demo'

result = re.match('^He.*(\d+).*Demo$', content)

print(result.group(1))

# That's 7
# match '^He.*(\d+).*Demo$'.*(greedy match) can be as many matches as possible to lLO 123456 and the re will still be valid
Copy the code

** Non-greedy matches: ** To obtain a non-greedy match, 1234567 can be used.*? Is the non-greedy matching pattern. Examples are as follows:


import re

content = 'Hello 1234567 World_This is a Regex Demo'

result = re.match('^He.*? (\d+).*Demo$', content)

print(result.group(1))

# 1234567

Copy the code

Greedy matching is to match as many characters as possible. Non-greedy matching is to match as few characters as possible. This is followed by \d+ to match numbers when.*? \d+ = “Hello”; \d+ = “Hello”; The match is stopped and left to \d+ to match the following number. So. *? Matching as few characters as possible, \d+ results in 1234567.

3. Modifiers

Re. L is locale-aware. Re.m matches multiple lines, affecting ^ and $re.s. Matches all characters, including newlines. Re.u resolves characters according to the Unicode character set. This flag affects \w, \w, \b, \ b. re.X. This flag allows you to write regular expressions that are easier to understand by giving you a more flexible format. Re.S and RE.I are commonly used in web page matching.Copy the code

Example:


import re

content = ' ''Hello 1234567 World_This is a Regex Demo '' '

result = re.match('^He.*? (\d+).*? Demo$', content)

print(result.group(1))

AttributeError: 'NoneType' object has no attribute 'group'

Copy the code

Because the requirement matches the text content with a newline character (there are newlines), and. Any character except newline is matched, so the match fails. So here we just need to add a modifier re.s to fix the error.


result = re.match('^He.*? (\d+).*? Demo$', content, re.S)

# 1234567
# The third argument to the match() method is passed to re.s, which makes the. Matches all characters, including newlines.
Copy the code

Re library functions

  • Re.match (): matches from the beginning of the string to the matching one;

import re

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'

result = re.match('Hello.*? (\d+).*? Demo', content)

print(result)

The result is None

Copy the code
  • Re.search (): scans the entire string for a match and returns the first successful match;

import re

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'

result = re.search('Hello.*? (\d+).*? Demo', content)

print(result)

< _sre.sre_match object; span=(13, 53), match='Hello 1234567 World_This is a Regex Demo'>

Copy the code
  • Re.findall (): Scans the entire string when it matches and returns all the contents of the matching regular expression;

  • Re.sub (): matches the entire string and replaces it;


import re

content = '54aK54yr5oiR54ix5L2g'

content = re.sub('\d+'.' ', content)

print(content)

# Result: aKyroiRixLg
#re.sub(' regular ', 'substitute content', validated text)
Copy the code
  • Re.compile (): Compiles the regular string into a regular expression object for reuse in later matches;

import re

content1 = 'the 2016-12-15 12:00'

content2 = 'the 2016-12-17 12:55'

content3 = '2016-12-22 "'

pattern = re.compile('\d{2}:\d{2}')

result1 = re.sub(pattern, ' ', content1)

result2 = re.sub(pattern, ' ', content2)

result3 = re.sub(pattern, ' ', content3)

print(result1, result2, result3)

# run result: 2016-12-15 2016-12-17 2016-12-22

Compile the regular string into the regular expression object Pattern, followed by a direct call to pattern

Copy the code

References:

Regular expressions: [germey. Gitbooks. IO/python3webs…].

Online regular expression tests: tool.oschina.net/regex/#