Regular expressions (RE)
- It’s a computer science concept
- Used to match a string that matches a rule using a single string
- Text that is often used to retrieve and replace certain patterns
Regular notation
-
.(dot): represents any character except \n, such as finding all one character.
-
[]: Matches any characters listed in parentheses, such as [L,Y,0], LLY, Y0, LIU
-
\ D: Any number
-
\D: Anything but numbers
-
\ S: indicates space, TAB key
-
\S: Except for blank space
-
\ W: Word characters, namely A-z, A-z, 0-9, _
-
\W: Anything except “\W”
-
: indicates that the preceding content is repeated zero or more times, \w
-
+: indicates that the previous content appears at least once
-
? : Zero or once of the previous content
-
{m,n}: allow the previous content to appear at least m times, at most N times
-
^: Matches the beginning of the string
-
$: matches the end of the string
-
\ B: Match word boundaries
-
(): Groups the contents of the regular expression, starting with the first parentheses and increasing in number
To verify a number: ^\d$must have a number, at least one digit: ^\d+$Can only appear numbers, and the number of digits is 5-10: ^\d{5,10}$Register age, 16 years old or older,99 years old or younger: ^[16,99]$Only English characters and numbers can be entered: ^[a-za-z0-9]$verify qq number: [0-9]{5,12}Copy the code
-
\A: matches only the beginning of the string, \Aabcd, then abcd
-
\Z: Matches only the end of the string, abcd\Z, abcd
-
| : about any one
-
(? P…) : group, make an alias in addition to the original number, (? P12345){2}, 1234512345
-
(? P=name): reference group
RE uses rough steps
- Use compile to compile the string representing the re into a pattern object
- The pattern object provides a series of method degree text to find the Match and obtain the Match result, a Match object
- Finally, use the properties and methods provided by the Match object to get the information and operate as needed
RE common functions
- Group (): To get one or more matching strings, use group or group(0) to get the whole matching string.
- Start: Gets the starting position of the substring matched by the grouping in the entire string. The default argument is 0
- End: Gets the end position of the grouping matched substring in the entire string. Default is 0
- Span: returned structural techniques (start(group), end(group))
Import related packages
import re
# find a number
# r indicates that the string is not escaped
p = re.compile(r'\d+')
# look in the string "one12twothree33456Four78", according to the re set by rule P
If None is returned, the match object is returned
m = p.match("one12twothree33456four78")
print(m)
Copy the code
None
Copy the code
Import related packages
import re
# find a number
# r indicates that the string is not escaped
p = re.compile(r'\d+')
# look in the string "one12twothree33456Four78", according to the re set by rule P
If None is returned, the match object is returned
Parameter 3,6 indicates the range to look for in the string
m = p.match("one12twothree33456four78".3.26)
print(m)
# The problem with the above code
# 1. Match can input arguments to indicate the starting position
# 2. Only one result is found, indicating that the first match was successful
Copy the code
<_sre.SRE_Match object; span=(3, 5), match='12'>
Copy the code
print(m[0])
print(m.start(0))
print(m.end(0))
Copy the code
12 March 5Copy the code
import re
# I means case is ignored
p = re.compile(r'([a-z]+) ([a-z]+)', re.I)
m = p.match("I am really love you")
print(m)
Copy the code
<_sre.SRE_Match object; span=(0, 4), match='I am'>
Copy the code
print(m.group(0))
print(m.start(0))
print(m.end(0))
Copy the code
I am
0
4
Copy the code
print(m.group(1))
print(m.start(1))
print(m.end(1))
Copy the code
I
0
1
Copy the code
print(m.group(2))
print(m.start(2))
print(m.end(2))
Copy the code
am
2
4
Copy the code
print(m.groups())
Copy the code
('I', 'am')
Copy the code
To find the
- Search (STR, [, pos[, endpos]]): Looks for a match in the string, with pos and endpos representing the starting position
- Findall: Finds all
- Finditer: To find an iter result
import re
p = re.compile(r'\d+')
m = p.search("one12two34three567four")
print(m.group())
Copy the code
12
Copy the code
rst = p.findall("one12two34three567four")
print(type(rst))
print(rst)
Copy the code
<class 'list'>
['12', '34', '567']
Copy the code
Sub replaced
- sub(rep1, str[, count])
# sub replacement case
import re
# \w contains numbers and letters
p = re.compile(r'(\w+) (\w+)')
s = "hello 123 wang 456, i love you"
rst = p.sub(r'Hello world', s)
print(rst)
Copy the code
Hello world Hello world, Hello world you
Copy the code
Matching Chinese
- Most Chinese representations range is [u4e00-U9FA5] and do not include full-angle punctuation
import re
title = 'Hello world, Hello Moto'
p = re.compile(r'[\u4e00-\u9fa5]+')
rst = p.findall(title)
print(rst)
Copy the code
[' World ', 'Hello ']Copy the code
Greed and non-greed
- Greedy: As many matches as possible, (*) indicates greedy matches
- Not greedy: find the smallest content that fits the criteria, (?) Not greedy
- The re uses greedy matching by default
import re
title = u'<div>name</div><div>age</div>'
p1 = re.compile(r'<div>.*</div>')
p2 = re.compile(r'
.*?
')
m1 = p1.search(title)
print(m1.group())
m2 = p2.search(title)
print(m2.group())
Copy the code
<div>name</div><div>age</div>
<div>name</div>
Copy the code