I. Summary of common matching rules:
model | describe |
---|---|
\w |
Matches alphanumeric and underscore |
\W |
Matches non-alphanumeric and underscore |
\s |
Matches any whitespace character, equivalent to [\t\n\r\f]. |
\S |
Matches any non-null character |
\d |
Matches any number, equivalent to [0-9] |
\D |
Matches any non-number |
\A |
Start of matching string |
\Z |
Matches the end of the string. If there is a newline, only the end string before the newline is matched |
\z |
End of matching string |
\G |
Match The position where the match was completed |
\n |
Matches a newline character |
\t |
Matches a TAB character |
^ |
Matches the beginning of the string |
$ |
Matches the end of the string |
. |
Matches any character except newline. When the re.DOTALL flag is specified, any character including newline can be matched |
[...]. |
Used to represent a set of characters, listed separately: [amk] matches ‘a’, ‘m’ or ‘k’ |
[^...]. |
Characters not in [] : matches characters other than a, B, and c. |
* |
Matches zero or more expressions. |
+ |
Matches one or more expressions. |
? |
Matches zero or one fragment defined by the previous regular expression, in a non-greedy manner |
{n} |
Matches exactly n preceding expressions. |
{n, m} |
Matches fragments defined by the previous regular expression n to m times, greedy way |
a | b |
Matches a or B (delimiter in the middle) |
( ) |
Matches the expression in parentheses, also representing a group |
Two, universal matching symbol
The. (dot) matches any character (except newline), and the * (star) matches the preceding character an infinite number of times, so when combined, the.* matches any character without having to match each character.
Greedy matching:.* can match as many characters as you want, but normally it will match as many characters as possible, as shown in the following example:
import re
content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*(\d+).*Demo$', content)
print(result.group(1))
# That's 7
# match '^He.*(\d+).*Demo$'.*(greedy match) can be as many matches as possible to lLO 123456 and the re will still be valid
Copy the code
** Non-greedy matches: ** To obtain a non-greedy match, 1234567 can be used.*? Is the non-greedy matching pattern. Examples are as follows:
import re
content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*? (\d+).*Demo$', content)
print(result.group(1))
# 1234567
Copy the code
Greedy matching is to match as many characters as possible. Non-greedy matching is to match as few characters as possible. This is followed by \d+ to match numbers when.*? \d+ = “Hello”; \d+ = “Hello”; The match is stopped and left to \d+ to match the following number. So. *? Matching as few characters as possible, \d+ results in 1234567.
3. Modifiers
Re. L is locale-aware. Re.m matches multiple lines, affecting ^ and $re.s. Matches all characters, including newlines. Re.u resolves characters according to the Unicode character set. This flag affects \w, \w, \b, \ b. re.X. This flag allows you to write regular expressions that are easier to understand by giving you a more flexible format. Re.S and RE.I are commonly used in web page matching.Copy the code
Example:
import re
content = ' ''Hello 1234567 World_This is a Regex Demo '' '
result = re.match('^He.*? (\d+).*? Demo$', content)
print(result.group(1))
AttributeError: 'NoneType' object has no attribute 'group'
Copy the code
Because the requirement matches the text content with a newline character (there are newlines), and. Any character except newline is matched, so the match fails. So here we just need to add a modifier re.s to fix the error.
result = re.match('^He.*? (\d+).*? Demo$', content, re.S)
# 1234567
# The third argument to the match() method is passed to re.s, which makes the. Matches all characters, including newlines.
Copy the code
Re library functions
- Re.match (): matches from the beginning of the string to the matching one;
import re
content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
result = re.match('Hello.*? (\d+).*? Demo', content)
print(result)
The result is None
Copy the code
- Re.search (): scans the entire string for a match and returns the first successful match;
import re
content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
result = re.search('Hello.*? (\d+).*? Demo', content)
print(result)
< _sre.sre_match object; span=(13, 53), match='Hello 1234567 World_This is a Regex Demo'>
Copy the code
-
Re.findall (): Scans the entire string when it matches and returns all the contents of the matching regular expression;
-
Re.sub (): matches the entire string and replaces it;
import re
content = '54aK54yr5oiR54ix5L2g'
content = re.sub('\d+'.' ', content)
print(content)
# Result: aKyroiRixLg
#re.sub(' regular ', 'substitute content', validated text)
Copy the code
- Re.compile (): Compiles the regular string into a regular expression object for reuse in later matches;
import re
content1 = 'the 2016-12-15 12:00'
content2 = 'the 2016-12-17 12:55'
content3 = '2016-12-22 "'
pattern = re.compile('\d{2}:\d{2}')
result1 = re.sub(pattern, ' ', content1)
result2 = re.sub(pattern, ' ', content2)
result3 = re.sub(pattern, ' ', content3)
print(result1, result2, result3)
# run result: 2016-12-15 2016-12-17 2016-12-22
Compile the regular string into the regular expression object Pattern, followed by a direct call to pattern
Copy the code
References:
Regular expressions: [germey. Gitbooks. IO/python3webs…].
Online regular expression tests: tool.oschina.net/regex/#