This article focuses on regular expressions and Python’s RE library.

Regular expressions

Regular expression is a powerful tool for processing strings. It has its own specific syntax structure, which can efficiently achieve the operation of string retrieval, replacement, match verification and so on.

We can use regular expressions to extract information of interest from the HTML, just as we did when we used the Requests library to crawl problems in zhihu discovery.

The following table shows the common matching rules of regular expressions:

model describe
\w Matches alphanumeric and underscore
\W Matches non-alphanumeric digits and underscores
\s Matches any whitespace character, equivalent to[\t\n\r\f]
\S Matches any non-null character
\d Matching any number is the same thing as[0-9]
\D Matches any non-number
\A Match string start
\Z End of the match string. If there is a newline, only the end string before the newline is matched
\z Match string End
\G Matches the position where the last match was completed
\n Matches a newline character
\t Matches a TAB character
^ Matches the beginning of the string
$ Matches the end of the string
. Matches any character except the newline, and when the re.dotall tag is specified, matches any character including the newline
[...]. Used to represent a group of characters, listed separately:[amk]Match ‘a’, ‘m’ or ‘k’
[^...]. Characters not in [] :[^abc]Matches characters other than a,b, and c.
* Matches zero or more expressions.
+ Matches one or more expressions.
? Matches zero or one fragment defined by the previous regular expression, non-greedy
{n} Matches exactly n of the preceding expressions.
{n, m} Matches the fragment defined by the previous regular expression n to m times, in greedy fashion
`a b`
( ) Matches expressions in parentheses, also representing a group

⚠️ [Note] With greedy matches,.* matches as many characters as possible; Non-greedy matches match as few characters as possible. ⚠️ [Note] Escape: If we want to represent some of the symbols in the table in the re, we need to escape, using the \+ symbol. For example, to represent parentheses (, we need to write \(

For example, we can crawl the problem in “Zhihu-discover” by using the following regular expression:

explore-feed.*? question_link.*? > (. *?) </Copy the code

It matches the requested HTML:

<div class="explore-feed feed-item" data-offset="1">
<h2><a class="question_link" href="/question/311635229/answer/..." target="_blank" data-id="..." data-za-element-name="Title">.</a></h2>
Copy the code

Note: TAOUP has a diagram that also nicely summarizes regular expressions, and I post it here for your reference:

RE library

Python’s built-in RE library provides support for regular expressions.

Here’s how to use the Re library:

match()

The match() method, which requires us to pass in the string to match and the regular expression, checks to see if the string matches the regular expression.

The match() method determines whether there is a match and returns a match object if a match is successful, or None otherwise.

Match () will match from the start, that is, if the first character doesn’t match, it doesn’t match.

We often use it like this:

test = 'String to test'
if re.match(R 'regular expression', test):
    print('Match')
else:
    print('Failed')
Copy the code

Extract groups:

>>> text = 'abc 123'

>>> print(re.match(r'\s+\w\d', text))
None

>>> r = re.match(r'\w*? (\d{3})', text)
>>> r
<re.Match object; span=(0.7), match='abc 123'>
>>> r.group()
'abc 123'
>>> r.group(0)
'abc 123'
>>> r.group(1)
'123'
Copy the code
  • Group () and group(0) output a complete match.
  • Group (1), and not in this case group(2), group(3)… It outputs one, two, three… A be(a)Enclosing the matching result.

The modifier

The modifier describe
re.I Make the match case-insensitive
re.L Do locale-aware matching
re.M Multi-line matching, impact^$
re.S make.Matches all characters including newlines
re.U Parse characters according to the Unicode character set. This symbol affects\w.\W.\b.\B.
re.X This flag allows you to write regular expressions that are easier to understand by giving you more flexibility in the format.

These modifiers can be passed as the third argument to re.match, producing the effect described above.

result = re.match('^He.*? (\d+).*? Demo$', content, re.S)
Copy the code

search()

Unlike match(), search() scans the entire string for a match and returns the first successful match, or None if there is None.

>>> p = '\d+'
>>> c = 'asd123sss'
>>> r = re.search(p, c)
>>> r
<re.Match object; span=(3.6), match='123'>
>>> r.group()
'123'
Copy the code

findall()

Findall () searches the entire string and returns everything that matches the regular expression, returning a list as a result.

>>> import re
>>> p = '\d+'
>>> c = 'asd123dfg456;;; 789 '
>>> re.findall(p, c)
['123'.'456'.'789']
Copy the code