What is a regular expression?
A regular expression is a special sequence of characters that helps you easily check if a string matches a pattern. For example, when writing programs or web pages that process strings, there is often a need to find strings that conform to some complex rules. Regular expressions are a tool for describing these rules. In other words, regular expressions are code that records rules for text.
Single character matching
character | function |
---|---|
. | Matches any 1 character except \n |
[] | Matches the characters listed in [] |
\d | Matches the numbers, that is, 0-9 |
\D | The match is not a number, that is, it is not a number |
\s | Matches blank, that is, the space, TAB key |
\S | Match non-blank |
\w | Matches word characters, that is, A-z, A-z, 0-9, _ |
\W | Matches non-word characters |
Multi-character matching
character | function |
---|---|
* | Matches the preceding character 0 or infinite times, is optional |
+ | Matches the previous character once or infinitely, that is, at least once |
? | Matches the previous character 1 or 0 times, that is, either 1 or none |
{m} | M occurrences of the preceding character |
{m,n} | Matches occurrences of the preceding character from m to N times |
Match the beginning and end
character | function |
---|---|
^ | Matching the beginning of a string |
$ | Matching the end of a string |
Match the grouping
character | function |
---|---|
| | Matches left and right expressions |
(ab) | Group the characters in parentheses as a group |
\num |
Reference group num Matches the string |
(? P<name>) |
Group alias |
(? P=name) | Reference the string matched by the group alias name |
Common regular expressions
Match the content | Regular expression |
---|---|
Chinese characters | [\u4e00-\u9fa5] |
Double-byte character | [^\x00-\xff] |
Blank lines | \s |
Email address | \w[-\w.+]*@([A-Za-z0-9][-A-Za-z0-9]+\.) + [A Za - z] {2, 14} |
URL | ^((https|http|ftp|rtsp|mms)? :\/\/)[^\s]+ |
Mobile Phone Number (Domestic) | 0? (13 14 15 | | | | 17 18) [0-9] {9} |
Telephone No. (Domestic) | [0-9 - () ()] {7} 16 |
Negative floating point number | -([1-9]\d*.\d*|0.\d*[1-9]\d*) |
Match the integer | -? [1-9]\d* |
Are floating point Numbers | [1-9]\d*.\d*|0.d*[1-9]\d* |
Tencent QQ number | [1-9] ([0-9] {5, 11}) |
The zip code | \d{6} |
Id Card Number | \d{17}[\d|x]|\d{15} |
Date format | \d{4}(\-|\/|.) 1 \ \ d {1, 2} \ d {1, 2} |
Positive integer | [1-9]\d* |
Negative integer | -[1-9]\d* |
The user name | [A-Za-z0-9_\-\u4e00-\u9fa5]+ |
The IP address
Regular expression
(25[0-5] |2[0-4]\d|[0-1]\d{2} | [1-9]? \d)\.(25[0-5] |2[0-4]\d|[0-1]\d{2} | [1-9]? \d)\.(25[0-5] |2[0-4]\d|[0-1]\d{2} | [1-9]? \d)\.(25[0-5] |2[0-4]\d|[0-1]\d{2} | [1-9]? \d)Copy the code
Python re module
When you need to match strings with regular expressions in Python, you can use a module called re.
Re is a regular expression.
The RE module gives the Python language full regular expression functionality.
Re. Match function
Re.match attempts to match a pattern from the start of the string, and match() returns None if the start position is not a successful match.
The function of grammar
re.match(pattern, string, flags=0)
Copy the code
Function Parameter Description
parameter | describe |
---|---|
pattern |
Matching regular expression |
string |
String to match. |
flags |
Flag bit used to control the matching of regular expressions, such as case – sensitive, multi-line matching, and so on. |
The re.match method returns an object that matches (match) on success, otherwise returns None.
We can use the group(num) or groups() match object functions to get a match expression.
Matching object method | describe |
---|---|
group(num=0) |
Group () can enter more than one group number at a time, in which case it will return a tuple containing the values of those groups, 0 by default. |
groups() |
Returns a tuple containing all of the group strings, from 1 to the contained group number. |
span() |
Returns the start and end positions of successfully matched characters, resulting in tuples(start, end) |
Test sample
Example 1:
# -*- coding:utf-8 -*-
import re
print(re.match('www'.'www.csdn.net')) # match at the starting position
print(re.match('net'.'www.csdn.net')) # Does not match at the starting position
Copy the code
The output is as follows:
<re.Match object; span=(0.3), match='www'>
None
Copy the code
Span =(0, 3) indicates the start position and end position of a successful match.
Example 2:
Extract the main data of the article
75 likes, comment 12, favorites 231
# -*- coding:utf-8 -*-
import re
line = U "Uplike 75, Comment 12, favorites 231"
match_obj = re.match( R 'like (\d*). Comment (\d*). Collect (\d*)', line, re.M|re.I)
if match_obj:
print ("match_obj.group() : ", match_obj.group())
print ("match_obj.group(1) : ", match_obj.group(1))
print ("match_obj.group(2) : ", match_obj.group(2))
print ("match_obj.group(3) : ", match_obj.group(3))
else:
print ("No match!!")
Copy the code
The output is as follows:
Match_obj. Group () : has been great75, comments,12Collection,231
match_obj.group(1) : 75
match_obj.group(2) : 12
match_obj.group(3) : 231
Copy the code
re.I
Make the match case insensitivere.M
Multi-line matching, affecting ^ and $
More logos will be covered later in this article.
Re search method
Re.search scans the entire string and returns the first successful match.
The function of grammar
re.search(pattern, string, flags=0)
Copy the code
Function Parameter Description
parameter | describe |
---|---|
pattern |
Matching regular expression |
string |
String to match. |
flags |
Flag bit used to control the matching of regular expressions, such as case – sensitive, multi-line matching, and so on. |
As with match, the re.search method returns an object that matches on success, otherwise returns None.
Test sample
Example 1:
# -*- coding:utf-8 -*-
import re
print(re.search('www'.'www.csdn.net')) # match at the starting position
print(re.search('net'.'www.csdn.net')) # Does not match at the starting position
Copy the code
The output of the above example is as follows:
<re.Match object; span=(0.3), match='www'>
<re.Match object; span=(9.12), match='net'>
Copy the code
Example 2:
Extract information from blogs
Written by Clever_Hui, Read 4155, collection 231
# -*- coding:utf-8 -*-
import re
line = U "By Clever_Hui, Read 4155, collector 231, Python"
search_obj = re.search( R 'author (\w*).Reading quantity (\d*).Collection (\d*).Classification column (\w*)', line, re.M|re.I)
if search_obj:
print ("search_obj.group() : ", search_obj.group())
print ("search_obj.group(1) : ", search_obj.group(1))
print ("search_obj.group(2) : ", search_obj.group(2))
print ("search_obj.group(3) : ", search_obj.group(3))
print ("search_obj.group(4) : ", search_obj.group(4))
else:
print ("Nothing found!!!")
Copy the code
The execution results of the above examples are as follows:
Search_obj.group () : Author Clever_Hui, Read4155Collection,231Python search_obj. Group (1) : Clever_Hui
search_obj.group(2) : 4155
search_obj.group(3) : 231
search_obj.group(4) : python
Copy the code
Re.match differs from Re.search
Re.match attempts to match a pattern from the start of the string, only at the beginning of the string, and match() returns None if not the start.
Re.search scans the entire string and returns the first successful match, or None if None.
Test sample
# -*- coding:utf-8 -*-
import re
match_obj = re.match('net'.'www.csdn.net')
rearch_obj = re.search('net'.'www.csdn.net')
if match_obj:
print('match --> ', match_oj)
else:
print('No match!!! ')
if rearch_obj:
print('search --> ', rearch_obj)
else:
print('No Match!!! ')
Copy the code
The output is as follows:
No match!!!
search --> <re.Match object; span=(9.12), match='net'>
Copy the code
Re. The.findall function
The function of grammar
findall(pattern, string, flags=0)
Copy the code
Function Parameter Description
parameter | describe |
---|---|
pattern |
Matching regular expression |
string |
String to match. |
flags |
Flag bit used to control the matching of regular expressions, such as case – sensitive, multi-line matching, and so on. |
The official documentation
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
Copy the code
Result All strings matching Pattern are returned in the form of a list
If there is one or more capture groups in pattern, a list of groups is returned,
If pattern has more than one group, this will be a list of tuples
The result contains an empty match.
Test sample
Python = 9999+, C = 7890, Java = 12345
# -*- coding:utf-8 -*-
import re
line = "python = 9999, c = 7890, java = 12345"
ret1 = re.findall(r"\d+", line)
ret2 = re.findall(r"(\w+)\s.\s(\d+)", line)
print(ret1)
print(ret2)
Copy the code
Running results:
['9999'.'7890'.'12345']
[('python'.'9999'), ('c'.'7890'), ('java'.'12345')]
Copy the code
Re. The sub function
Python’s re module provides re.sub to replace matches in strings.
The function of grammar
re.sub(pattern, repl, string, count=0, flags=0)
Copy the code
The official documentation
sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the Match object and must return a replacement string to be used.Copy the code
Function Parameter Description
parameter | describe |
---|---|
pattern |
Matching regular expression |
repl |
Replace a string or a function |
string |
String to match |
count |
Maximum number of patterns to be replaced after a pattern match |
flags |
Flag bit used to control the matching of regular expressions, such as case – sensitive, multi-line matching, and so on |
The string returned is replaced by the leftmost non-repeating match of the RE in the string. If the pattern is not found, the character is returned unchanged.
The optional parameter count is the maximum number of times a pattern can be replaced after matching. Count must be a non-negative integer. The default value is 0 to replace all matches.
The test case
Requirement: Add 1 to the number of reads matched
Method 1:
# -*- coding:utf-8 -*-
import re
ret = re.sub(r"\d+".'998'."python = 997")
print(ret)
Copy the code
Running results:
python = 998
Copy the code
Method 2:
# -*- coding:utf-8 -*-
import re
def add(temp) :
strNum = temp.group()
num = int(strNum) + 1
return str(num)
ret = re.sub(r"\d+", add, "python = 997")
print(ret)
ret = re.sub(r"\d+", add, "python = 99")
print(ret)
Copy the code
Running results:
python = 998
python = 100
Copy the code
Regular expression modifier – Optional flag
Regular expressions can contain optional flag modifiers to control the pattern of matches. The modifier is specified as an optional flag. Multiple sign can be through the bitwise OR (|) to specify them. Such as re. | I re. M is set to the I and M logo:
The modifier | describe |
---|---|
re.I |
Make the match case insensitive |
re.L |
Do localization identification(locale-aware) matching |
re.M |
Multi-line matching, influence^ 和 $ |
re.S |
Causes. To match all characters, including newlines |
re.U |
According to theUnicode Character set parsing characters. This sign affects \w, \W, \b, \B. |
re.X |
This flag allows you to make your regular expressions easier to understand by giving you a more flexible format. |
Greed and non-greed
Quantifiers in Python are greedy by default (and perhaps non-greedy by default in a few languages) and always try to match as many characters as possible;
Non-greedy, on the other hand, always tries to match as few characters as possible.
In the *,? , +, {m,n} followed by? And turn greed into ungreed.
# -*- coding:utf-8 -*-
import re
s="This is a number 234-235-22-423"
r=re.match(".+(\d+-\d+-\d+-\d+)",s)
print(r.group(1))
4-235-22-423 '#'
r=re.match(". +? (\d+-\d+-\d+-\d+)",s)
print(r.group(1))
# '234-235-22-423'
Copy the code
Use the wildcard character in a regular expression pattern that it in order from left to right, will try to grab satisfy matching the longest string, in our example above, the + can satisfy model from the start of the string in order to grab the longest characters, including the first integer we want most of the fields, \d+ takes only one character to match, so it matches the number 4, while.+ matches all characters from the beginning of the string up to the first number 4.
Workaround: Non-greedy operators? This operator can be used with *,? , +, {m,n}, require as little regular match as possible.
>>> re.match(r"aa(\d+)"."aa2343ddd").group(1)
'2343'
>>> re.match(r"aa(\d+?) "."aa2343ddd").group(1)
'2'
>>> re.match(r"aa(\d+)ddd"."aa2343ddd").group(1)
'2343'
>>> re.match(r"aa(\d+?) ddd"."aa2343ddd").group(1)
'2343'
>>>
Copy the code
R native string function
>>> mm = "c:\\a\\b\\c"
>>> mm
'c:\\a\\b\\c'
>>> print(mm)
c:\a\b\c
>>> re.match("c:\\\\",mm).group()
'c:\\'
>>> ret = re.match("c:\\\\",mm).group()
>>> print(ret)
c:\
>>> ret = re.match("c:\\\\a",mm).group()
>>> print(ret)
c:\a
>>> ret = re.match(r"c:\\a",mm).group()
>>> print(ret)
c:\a
>>> ret = re.match(r"c:\a",mm).group()
Traceback (most recent call last):
File "<stdin>", line 1.in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>>
Copy the code
instructions
In Python, a string preceded by r means a native string,
As with most programming languages, regular expressions use “\” as an escape character, which can cause backslash trouble. If you need to match the character \ in the text, you will need four backslashes in the regular expression in the programming language: the first two and the last two are used to escape into backslashes in the programming language, and two backslashes into a single backslash in the regular expression.
Native strings in Python solve this problem nicely. With native strings, you don’t have to worry about missing backslashes, and the expressions you write are much more intuitive.
>>> ret = re.match(r"c:\\a",mm).group()
>>> print(ret)
c:\a
Copy the code
Integrated case
Matches whether the variable name is valid
# -*- coding:utf-8 -*-
import re
names = ["name1"."_name"."2_name"."__name__"]
for name in names:
ret = re.match("[a-zA-Z_]+[\w]*",name)
if ret:
print("Variable name %s is valid" % ret.group())
else:
print("Invalid variable name %s" % name)
Copy the code
The results
Variable name name1 Meets requirements Variable name _name Meets requirements Variable name 2_name Invalid variable name __name__ Meets requirementsCopy the code
Matched an 8-to-15-digit code
The password can contain uppercase letters or digits, but must start with an uppercase letter
# -*- coding:utf-8 -*-
import re
pwds = ["W123456W"."wwj123456"."W123456789"."w123w"."W12345678abcdefg"]
pwd_pattern_str = "^ [a-z] [A - zA - Z0-9] {7, 14} $"
for pwd in pwds:
ret = re.match(pwd_pattern_str, pwd)
if ret:
pwd = ret.group()
print("Password %s meets requirements, length: %s" % (pwd, len(pwd)))
else:
print("Password %s is invalid" % pwd)
Copy the code
The running results are as follows:
Password W123456W The password must meet the following requirements.8Password wwj123456 Invalid password W123456789 The password must meet the following requirements:10Password w123w Invalid Password W12345678abcdefg Invalid passwordCopy the code
Regular expression online tool
Regular expression online tool at https://www.w3cschool.cn/tools/index?name=create_reg
This online tool provides the online generation function of common regular expressions, such as characters, URLS, postcodes, dates, and Chinese. It also provides various common languages, such as: Javascript, PHP, Go language, Java, Ruby, Python and other regular expression test statements for your reference.
The public,
Create a new folder X
Nature took tens of billions of years to create our real world, while programmers took hundreds of years to create a completely different virtual world. We knock out brick by brick with a keyboard and build everything with our brains. People see 1000 as authority. We defend 1024. We are not keyboard warriors, we are just extraordinary builders of ordinary world.