What is a regular expression?

A regular expression is a special sequence of characters that helps you easily check if a string matches a pattern. For example, when writing programs or web pages that process strings, there is often a need to find strings that conform to some complex rules. Regular expressions are a tool for describing these rules. In other words, regular expressions are code that records rules for text.


Single character matching

character function
. Matches any 1 character except \n
[] Matches the characters listed in []
\d Matches the numbers, that is, 0-9
\D The match is not a number, that is, it is not a number
\s Matches blank, that is, the space, TAB key
\S Match non-blank
\w Matches word characters, that is, A-z, A-z, 0-9, _
\W Matches non-word characters


Multi-character matching

character function
* Matches the preceding character 0 or infinite times, is optional
+ Matches the previous character once or infinitely, that is, at least once
? Matches the previous character 1 or 0 times, that is, either 1 or none
{m} M occurrences of the preceding character
{m,n} Matches occurrences of the preceding character from m to N times


Match the beginning and end

character function
^ Matching the beginning of a string
$ Matching the end of a string


Match the grouping

character function
| Matches left and right expressions
(ab) Group the characters in parentheses as a group
\num Reference group num Matches the string
(? P<name>) Group alias
(? P=name) Reference the string matched by the group alias name


Common regular expressions

Match the content Regular expression
Chinese characters [\u4e00-\u9fa5]
Double-byte character [^\x00-\xff]
Blank lines \s
Email address \w[-\w.+]*@([A-Za-z0-9][-A-Za-z0-9]+\.) + [A Za - z] {2, 14}
URL ^((https|http|ftp|rtsp|mms)? :\/\/)[^\s]+
Mobile Phone Number (Domestic) 0? (13 14 15 | | | | 17 18) [0-9] {9}
Telephone No. (Domestic) [0-9 - () ()] {7} 16
Negative floating point number -([1-9]\d*.\d*|0.\d*[1-9]\d*)
Match the integer -? [1-9]\d*
Are floating point Numbers [1-9]\d*.\d*|0.d*[1-9]\d*
Tencent QQ number [1-9] ([0-9] {5, 11})
The zip code \d{6}
Id Card Number \d{17}[\d|x]|\d{15}
Date format \d{4}(\-|\/|.) 1 \ \ d {1, 2} \ d {1, 2}
Positive integer [1-9]\d*
Negative integer -[1-9]\d*
The user name [A-Za-z0-9_\-\u4e00-\u9fa5]+


The IP address

Regular expression

(25[0-5] |2[0-4]\d|[0-1]\d{2} | [1-9]? \d)\.(25[0-5] |2[0-4]\d|[0-1]\d{2} | [1-9]? \d)\.(25[0-5] |2[0-4]\d|[0-1]\d{2} | [1-9]? \d)\.(25[0-5] |2[0-4]\d|[0-1]\d{2} | [1-9]? \d)Copy the code


Python re module

When you need to match strings with regular expressions in Python, you can use a module called re.

Re is a regular expression.

The RE module gives the Python language full regular expression functionality.


Re. Match function

Re.match attempts to match a pattern from the start of the string, and match() returns None if the start position is not a successful match.

The function of grammar

re.match(pattern, string, flags=0)
Copy the code


Function Parameter Description

parameter describe
pattern Matching regular expression
string String to match.
flags Flag bit used to control the matching of regular expressions, such as case – sensitive, multi-line matching, and so on.

The re.match method returns an object that matches (match) on success, otherwise returns None.

We can use the group(num) or groups() match object functions to get a match expression.

Matching object method describe
group(num=0) Group () can enter more than one group number at a time, in which case it will return a tuple containing the values of those groups, 0 by default.
groups() Returns a tuple containing all of the group strings, from 1 to the contained group number.
span() Returns the start and end positions of successfully matched characters, resulting in tuples(start, end)


Test sample

Example 1:

# -*- coding:utf-8 -*-
import re

print(re.match('www'.'www.csdn.net'))	# match at the starting position
print(re.match('net'.'www.csdn.net'))	# Does not match at the starting position
Copy the code

The output is as follows:

<re.Match object; span=(0.3), match='www'>
None
Copy the code

Span =(0, 3) indicates the start position and end position of a successful match.


Example 2:

Extract the main data of the article

75 likes, comment 12, favorites 231

# -*- coding:utf-8 -*-
import re

line = U "Uplike 75, Comment 12, favorites 231"

match_obj = re.match( R 'like (\d*). Comment (\d*). Collect (\d*)', line, re.M|re.I)

if match_obj:
   print ("match_obj.group() : ", match_obj.group())
   print ("match_obj.group(1) : ", match_obj.group(1))
   print ("match_obj.group(2) : ", match_obj.group(2))
   print ("match_obj.group(3) : ", match_obj.group(3))
else:
   print ("No match!!")
Copy the code

The output is as follows:

Match_obj. Group () : has been great75, comments,12Collection,231
match_obj.group(1) :  75
match_obj.group(2) :  12
match_obj.group(3) :  231
Copy the code
  • re.IMake the match case insensitive
  • re.MMulti-line matching, affecting ^ and $

More logos will be covered later in this article.


Re search method

Re.search scans the entire string and returns the first successful match.

The function of grammar

re.search(pattern, string, flags=0)
Copy the code

Function Parameter Description

parameter describe
pattern Matching regular expression
string String to match.
flags Flag bit used to control the matching of regular expressions, such as case – sensitive, multi-line matching, and so on.

As with match, the re.search method returns an object that matches on success, otherwise returns None.


Test sample

Example 1:

# -*- coding:utf-8 -*-
import re

print(re.search('www'.'www.csdn.net'))		# match at the starting position
print(re.search('net'.'www.csdn.net'))		# Does not match at the starting position
Copy the code

The output of the above example is as follows:

<re.Match object; span=(0.3), match='www'>
<re.Match object; span=(9.12), match='net'>
Copy the code


Example 2:

Extract information from blogs

Written by Clever_Hui, Read 4155, collection 231

# -*- coding:utf-8 -*-
import re

line = U "By Clever_Hui, Read 4155, collector 231, Python"

search_obj = re.search( R 'author (\w*).Reading quantity (\d*).Collection (\d*).Classification column (\w*)', line, re.M|re.I)

if search_obj:
   print ("search_obj.group() : ", search_obj.group())
   print ("search_obj.group(1) : ", search_obj.group(1))
   print ("search_obj.group(2) : ", search_obj.group(2))
   print ("search_obj.group(3) : ", search_obj.group(3))
   print ("search_obj.group(4) : ", search_obj.group(4))
else:
   print ("Nothing found!!!")
Copy the code

The execution results of the above examples are as follows:

Search_obj.group () : Author Clever_Hui, Read4155Collection,231Python search_obj. Group (1) :  Clever_Hui
search_obj.group(2) :  4155
search_obj.group(3) :  231
search_obj.group(4) :  python
Copy the code


Re.match differs from Re.search

Re.match attempts to match a pattern from the start of the string, only at the beginning of the string, and match() returns None if not the start.

Re.search scans the entire string and returns the first successful match, or None if None.

Test sample

# -*- coding:utf-8 -*-
import re

match_obj = re.match('net'.'www.csdn.net')
rearch_obj = re.search('net'.'www.csdn.net')

if match_obj:
    print('match --> ', match_oj)
else:
    print('No match!!! ')

if rearch_obj:
     print('search --> ', rearch_obj)
else:
    print('No Match!!! ')
Copy the code

The output is as follows:

No match!!!
search  -->   <re.Match object; span=(9.12), match='net'>
Copy the code


Re. The.findall function

The function of grammar

findall(pattern, string, flags=0)
Copy the code


Function Parameter Description

parameter describe
pattern Matching regular expression
string String to match.
flags Flag bit used to control the matching of regular expressions, such as case – sensitive, multi-line matching, and so on.

The official documentation

findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result.
Copy the code

Result All strings matching Pattern are returned in the form of a list

If there is one or more capture groups in pattern, a list of groups is returned,

If pattern has more than one group, this will be a list of tuples

The result contains an empty match.


Test sample

Python = 9999+, C = 7890, Java = 12345

# -*- coding:utf-8 -*-
import re

line = "python = 9999, c = 7890, java = 12345"
ret1 = re.findall(r"\d+", line)
ret2 = re.findall(r"(\w+)\s.\s(\d+)", line)

print(ret1)
print(ret2)
Copy the code

Running results:

['9999'.'7890'.'12345']
[('python'.'9999'), ('c'.'7890'), ('java'.'12345')]
Copy the code


Re. The sub function

Python’s re module provides re.sub to replace matches in strings.

The function of grammar

re.sub(pattern, repl, string, count=0, flags=0)
Copy the code

The official documentation


sub(pattern, repl, string, count=0, flags=0)
    Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the Match object and must return a replacement string to be used.Copy the code


Function Parameter Description

parameter describe
pattern Matching regular expression
repl Replace a string or a function
string String to match
count Maximum number of patterns to be replaced after a pattern match
flags Flag bit used to control the matching of regular expressions, such as case – sensitive, multi-line matching, and so on

The string returned is replaced by the leftmost non-repeating match of the RE in the string. If the pattern is not found, the character is returned unchanged.

The optional parameter count is the maximum number of times a pattern can be replaced after matching. Count must be a non-negative integer. The default value is 0 to replace all matches.


The test case

Requirement: Add 1 to the number of reads matched

Method 1:

# -*- coding:utf-8 -*-
import re

ret = re.sub(r"\d+".'998'."python = 997")
print(ret)
Copy the code

Running results:

python = 998
Copy the code


Method 2:

# -*- coding:utf-8 -*-
import re

def add(temp) :
    strNum = temp.group()
    num = int(strNum) + 1
    return str(num)

ret = re.sub(r"\d+", add, "python = 997")
print(ret)

ret = re.sub(r"\d+", add, "python = 99")
print(ret)
Copy the code

Running results:

python = 998
python = 100
Copy the code


Regular expression modifier – Optional flag

Regular expressions can contain optional flag modifiers to control the pattern of matches. The modifier is specified as an optional flag. Multiple sign can be through the bitwise OR (|) to specify them. Such as re. | I re. M is set to the I and M logo:

The modifier describe
re.I Make the match case insensitive
re.L Do localization identification(locale-aware)matching
re.M Multi-line matching, influence^ $
re.S Causes. To match all characters, including newlines
re.U According to theUnicodeCharacter set parsing characters. This sign affects \w, \W, \b, \B.
re.X This flag allows you to make your regular expressions easier to understand by giving you a more flexible format.


Greed and non-greed

Quantifiers in Python are greedy by default (and perhaps non-greedy by default in a few languages) and always try to match as many characters as possible;

Non-greedy, on the other hand, always tries to match as few characters as possible.

In the *,? , +, {m,n} followed by? And turn greed into ungreed.

# -*- coding:utf-8 -*-
import re

s="This is a number 234-235-22-423"
r=re.match(".+(\d+-\d+-\d+-\d+)",s)
print(r.group(1))
4-235-22-423 '#'

r=re.match(". +? (\d+-\d+-\d+-\d+)",s)
print(r.group(1))
# '234-235-22-423'
Copy the code

Use the wildcard character in a regular expression pattern that it in order from left to right, will try to grab satisfy matching the longest string, in our example above, the + can satisfy model from the start of the string in order to grab the longest characters, including the first integer we want most of the fields, \d+ takes only one character to match, so it matches the number 4, while.+ matches all characters from the beginning of the string up to the first number 4.

Workaround: Non-greedy operators? This operator can be used with *,? , +, {m,n}, require as little regular match as possible.

>>> re.match(r"aa(\d+)"."aa2343ddd").group(1)
'2343'
>>> re.match(r"aa(\d+?) "."aa2343ddd").group(1)
'2'
>>> re.match(r"aa(\d+)ddd"."aa2343ddd").group(1) 
'2343'
>>> re.match(r"aa(\d+?) ddd"."aa2343ddd").group(1)
'2343'
>>>
Copy the code


R native string function

>>> mm = "c:\\a\\b\\c"
>>> mm
'c:\\a\\b\\c'
>>> print(mm)
c:\a\b\c
    
>>> re.match("c:\\\\",mm).group()
'c:\\'

>>> ret = re.match("c:\\\\",mm).group()
>>> print(ret)
c:\
    
>>> ret = re.match("c:\\\\a",mm).group()
>>> print(ret)
c:\a
    
>>> ret = re.match(r"c:\\a",mm).group()
>>> print(ret)
c:\a
    
>>> ret = re.match(r"c:\a",mm).group()
Traceback (most recent call last):
  File "<stdin>", line 1.in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>>
Copy the code

instructions

In Python, a string preceded by r means a native string,

As with most programming languages, regular expressions use “\” as an escape character, which can cause backslash trouble. If you need to match the character \ in the text, you will need four backslashes in the regular expression in the programming language: the first two and the last two are used to escape into backslashes in the programming language, and two backslashes into a single backslash in the regular expression.

Native strings in Python solve this problem nicely. With native strings, you don’t have to worry about missing backslashes, and the expressions you write are much more intuitive.

>>> ret = re.match(r"c:\\a",mm).group()
>>> print(ret)
c:\a
Copy the code


Integrated case

Matches whether the variable name is valid

# -*- coding:utf-8 -*-
import re

names = ["name1"."_name"."2_name"."__name__"]

for name in names:
    ret = re.match("[a-zA-Z_]+[\w]*",name)
    if ret:
        print("Variable name %s is valid" % ret.group())
    else:
        print("Invalid variable name %s" % name)
Copy the code

The results

Variable name name1 Meets requirements Variable name _name Meets requirements Variable name 2_name Invalid variable name __name__ Meets requirementsCopy the code


Matched an 8-to-15-digit code

The password can contain uppercase letters or digits, but must start with an uppercase letter

# -*- coding:utf-8 -*-
import re

pwds = ["W123456W"."wwj123456"."W123456789"."w123w"."W12345678abcdefg"]
pwd_pattern_str = "^ [a-z] [A - zA - Z0-9] {7, 14} $"

for pwd in pwds:
    ret = re.match(pwd_pattern_str, pwd)
    if ret:
        pwd = ret.group()
        print("Password %s meets requirements, length: %s" % (pwd, len(pwd)))
    else:
        print("Password %s is invalid" % pwd)
Copy the code


The running results are as follows:

Password W123456W The password must meet the following requirements.8Password wwj123456 Invalid password W123456789 The password must meet the following requirements:10Password w123w Invalid Password W12345678abcdefg Invalid passwordCopy the code


Regular expression online tool

Regular expression online tool at https://www.w3cschool.cn/tools/index?name=create_reg

This online tool provides the online generation function of common regular expressions, such as characters, URLS, postcodes, dates, and Chinese. It also provides various common languages, such as: Javascript, PHP, Go language, Java, Ruby, Python and other regular expression test statements for your reference.


The public,

Create a new folder X

Nature took tens of billions of years to create our real world, while programmers took hundreds of years to create a completely different virtual world. We knock out brick by brick with a keyboard and build everything with our brains. People see 1000 as authority. We defend 1024. We are not keyboard warriors, we are just extraordinary builders of ordinary world.