preface

When you click on the article, I know this title is a little loaded force, ha ha, but it doesn’t matter, fortunately I write are dry goods.

Regular expression is a powerful tool to deal with strings. It has its own special syntax structure and can realize the retrieval, replacement and matching verification of strings.

Case introduced

Open source in China to provide regular expression test tool https://tool.oschina.net/regex/, input matching text, and then select the commonly used regular expressions, you can get the corresponding matching results.

In fact, this is the use of regular expression matching, that is, with certain rules to extract a particular text.

This works for email

[\w!# %&’*+/=?^_`{|}~-]+(? :\.[\w!#%&’+/=?^_`{|}~-]+)@(? :\w? .). +\w?

Match it out.

If the string of characters above doesn’t look like a mess, I’m going to enumerate common matching rules.

model describe
\w Matches letters, digits, and underscores
\W Matches characters other than letters, digits, and underscores
\s Matches any whitespace character, equivalent to {\t\n\r\f}
\S Matches any non-whitespace character
\d Matches any number, equivalent to [0-9]
\D Matches any non-numeric character
\A Matching the beginning of a string
\Z Matches the end of the string. If there is a newline, only the string before the newline is matched
\z Matches the end of the string and, if there is a newline, also the newline character
\G Matches the position where the match was last completed
\n Matches a newline character
\t Matches TAB characters
^ matchingA line ofThe beginning of a string
$ matchingA line ofEnd of string
. Matches any character except newline, or any character including newline when the re.DOTALL tag is specified
[…]. Used to indicate a set of characters listed separately, such as [amk] matching a,m,k
[^…]. Characters not in [], such as ^ ABC, match characters other than a,b, and c
* Matches zero or more expressions
+ Matches one or more expressions
? Matches zero or one fragment of the previous regular expression definition (non-greedy matching)
{n} Matches exactly n preceding expressions
{n,m} The fragment that matches n to m times and is matched by the previous regular expression (greedy matching)
a|b Matches a or B
( ) Matches the expression in parentheses, also representing a group

Will you feel a little dizzy after watching it?

Don’t worry, I’ll explain how to use this rule in more detail here.

Regular expressions are not unique to Python; they can be used in other programming languages. The re library in Python provides an implementation of regular expressions that can be used in Python.

match( )

Here is a common matching method, match(). You can check whether the regular expression matches the string by passing it a string and a regular expression.

The match() method matches the regular expression from the start of the string and returns success if it matches, or None if it doesn’t.

The sample is as follows

import re


content = 'Hello 123 456 World_This is a Regex Demo'
print(len(content))
result = re.match('^Hello\s\d\d\d\s\d{3}\s\w{10}', content)
print(result)
print(type(result))
print(result.group())
print(result.span())
Copy the code

This declares a string containing English characters, whitespace, numbers, and so on.

So let’s take a quick look at the regular expression we just wrote.

The beginning ^ indicates the beginning of the matching string, that is, beginning with Hello; Then \s matches the whitespace character; \d indicates the matching number; \d{3} indicates that the previous rule matches three times; \w matches digits, letters, and underscores. {10} indicates that the previous rule matches 10 times.

You can try running the code above and see that we didn’t match the string completely, but we can still match it, just with a shorter result.

In the match() method, the first argument is the regular expression and the second argument is the string passed to match.

When you print out the result, you can see that the result is a SRE_Match object, which proves a successful match. The object has two methods: the group() method prints the content; The span() method outputs the range of matches.

Match the target

Just use the match () method can match to the content of the string, if you want to part extracted from string, you can use the parentheses (), will want to extract substrings enclosed, () is actually a subexpression start and end tags, each marked sub-expression will, in turn, corresponds to each group, The group() method is called to pass in the grouping’s index to retrieve the extracted results.

The sample is as follows

import re


content = 'Hello 123456 World_This is a Regex Demo'
print(len(content))
result = re.match('^Hello\s(\d+)\sWorld', content)
print(result)
print(type(result))
print(result.group())
print(result.group(1))
print(result.span())
Copy the code

You can try writing and running the above sample code and you will see that we have successfully obtained 123456. Group (1) is used here. Unlike group(), which gets the full matching result, group(2) and group(3) will be used later to get the matching result.

General matching

In fact, the regular expression we wrote just now is quite complicated. The blank character is used \s, and the number is matched with \d. The workload is quite large. There is no need to use universal matching, which is.*, where the.(dot) matches any character (except newline), and the asterisk (asterisk) matches the preceding character an infinite number of times, so together they match any character.

The sample is as follows

import re


content = 'Hello 123 456 World_This is a Regex Demo'
print(len(content))
result = re.match('^Hello.*Demo$', content)
print(result)
print(result.group())
print(result.span())
Copy the code

Greed and non-greed

When using the general matching.* above, it is possible that sometimes the match is not the result we want. Look at the following example:

import re


content = 'Hello 123456 World_This is a Regex Demo'
print(len(content))
result = re.match('^He.*(\d+).*Demo$', content)
print(result)
print(result.group(1))
print(result.span())
Copy the code

Using the code above, you can see that the result of the match is 7, but that’s not exactly what we want.

This involves a greedy match and a non-greedy match. Matches as many characters as possible in greedy mode. In the regular expression, the.* is followed by \d+, which means at least one digit.

Therefore,.* matches as many strings as possible, matching all 12345, leaving numbers that satisfy \d.

I’m just going to use a non-greedy match, which is.*? One more? Let’s see what happens.

import re


content = 'Hello 123456 World_This is a Regex Demo'
print(len(content))
result = re.match('^He.*? (\d+).*Demo$', content)
print(result)
print(result.group(1))
print(result.span())
Copy the code

Run the above code and you will clearly see that you have obtained 123456.

A non-greedy match is one that matches as few characters as possible. When a number is matched, it does not go down, and then \d+ just goes down.

But note that if the result of the match is at the end of the string, then.*? No results are matched because non-greedy matches as little content as possible.

The modifier

A regular expression can control the pattern of matches by including optional flag modifiers, which are specified as an optional flag.

The following is an example:

import re


content = '''Hello 123456 World_This is a Regex Demo'''
print(len(content))
result = re.match('^He.*? (\d+).*Demo$', content)
print(result)
print(result.group(1))
print(result.span())
Copy the code

The results

None Traceback (most recent call last): File "D:/github/Python_scrapy/ code/demo5.py", line 9, in <module> print(result.group(1)) AttributeError: 'NoneType' object has no attribute 'group'Copy the code

The return value is None, resulting in an AttributeError error. The reason is that.(dot) can only match any character except newline. In the above program, you can see that there is a new line character in the middle, so the match fails.

You can fix this error by simply adding the modifier re.s.

result = re.match('^He.*? (\d+).*Demo$', content, re.S)
Copy the code

This re.s is often used in web page matching, where there are node newlines in HTML.

Here are some common modifiers

The modifier describe
re.I Make the match case insensitive
re.L Do local-aware matching
re.M Multi-line matching affects ^ and $
re.U This flag affects \w, \w, \b, and \b when parsing characters according to the Unicode character set
re.X This flag makes writing regular expressions easier to understand by giving you a more flexible format
re.S Matches all characters, including newlines

Escape to match

We know that regular expressions define a number of matching patterns, such as matching characters other than newlines, but what if the target string contains faces containing. (dot)?

This is where escape matching comes in.

Code examples:

import re

content = '(baidu) www.baidu.com
result = re.match('\ \ (baidu) www\.baidu\.com, content)
print(result.group())
Copy the code

Run the above code and you will find a successful match to the source string.

search( )

As mentioned earlier, the match() method matches from the beginning of the string, and if the beginning of the string does not match, the entire string is invalidated.

Since the match() method needs to consider matching from the beginning, this is not particularly convenient for us.

There is another method, the search() method, which scans the entire string until it finds the first string that matches the rules.

Search () is used similarly to match().

findall( )

The use of search(), mentioned earlier, matches the first string that matches the rule, but to match all strings that match the rule, you need the findAll () method. This method searches the entire string and then matches all strings that match the rule, using the same methods as search() and match().

sub( )

In addition to using regular expressions to match strings, you can also use regular expressions to modify text. For example, if you want to remove all numbers from a string, the replace() method of the string would be tedious. Instead, you can use the sub() method, as shown below:

import re

content = 'sdsd55wee66err33'
result = re.sub('\d+'.' ', content)
print(result)
Copy the code

Run the code above and you will see that all numbers in the string have been removed.

compile( )

While all of the previous methods are used to process strings, let’s look at the compile() method, which compiles regular strings into regular expression objects. In order to reuse in the later matching, the specific code is as follows:

import re

content1 = 'the 2020-12-29 02:35'
content2 = 'the 2020-12-30 03:35'
content3 = 'the 2020-12-31 01:35'

pattern = re.compile('\d{2}:\d{2}')
result1 = re.sub(pattern, ' ', content1)
result2 = re.sub(pattern, ' ', content2)
result3 = re.sub(pattern, ' ', content3)
print(result1, result2, result3)
Copy the code

Run the code above, and you’ll see that time and time are removed, and the matching rule is written only once.

The same compile() method can pass modifiers like re.s, re.i, etc., so that match(), search(), and findall() do not need to be passed.

The last

This is the end of the regular expression sharing, have you learned to waste? Let me know in the comments section.

If you’ve read this, it’s probably helpful, which is why I wrote it.

The way ahead is so long without ending, yet high and low I’ll search with my will unbending.

I am book-learning, a person who concentrates on learning. The more you know, the more you don’t know. See you next time for more exciting content!