Why learn regular when you know XPath and CSS selectors?

Regular expressions, which are parsed using standard regex, generally treat HTML as normal text and match it with a specified format as relevant text, suitable for small fragments of text, or a string of characters (such as a phone number or email account), or HTML that contains javascript code and cannot use CSS selectors or XPath

Online regular expression test sites tool.oschina.net/regex… ://docs.python.org/zh-cn/3/library/re.html

Understanding regular expressions

A regular expression is a logical formula used to manipulate strings. It uses predefined characters and their combinations to form a “regular string”, which is used to express the filtering logic of strings.

Common concepts of regular expressions

  • Border match ^ – matches at the beginning of the string and does not match any characters; $- matches where the string ends and does not match any characters;

    str = “cat abdcatdetf ios”

    ^cat: Verify that the line begins with a c followed by a, then t

    Ios $: Verify that the line ends with t and the second-to-last character is a and the third-to-last character is c

    ^cat$: begins with c followed by a->t followed by line end: a data line with only cat

    ^$: Immediately after the beginning: blank line, not including any characters

    ^ : Start of line, can match any line, because each line has a start of line

B — Matches a word boundary, that is, the position between a word and a space, and does not match any characters;

"Er \b" matches the "er" in "never", but not the "er" in "verb".Copy the code

B — B not, that is, match a non-word boundary;

"Er \B" matches the "er" in "verb", but not the "er" in "never".Copy the code
  • Greedy vs. non-greedy regular expressions for quantifiers are often used to find matching strings in text. Quantifiers in Python are greedy by default (and perhaps non-greedy by default in a few languages) and always try to match as many characters as possible; Non-greedy ones, on the other hand, always try to match as few characters as possible. Such as:

    The regular expression “ab*”, if used to find “abbbc”, will find “abbb”. And if you use the non-greedy quantifier “ab*?” , will find “A”.

  • Backslash problem

As with most programming languages, regular expressions use “” as an escape character, which can cause backslash trouble.

If you need to match the character “” in the text, then the regular expression in the programming language will need four backslashes “” : the first two and the last two are used to escape the backslashes in the programming language, and the two backslashes are used to escape the backslashes in the regular expression.

Native strings in Python solve this problem nicely. In this example, regular expressions can be represented by r””.

Similarly, “d” that matches a number can be written as r”d”. With native strings, you don’t have to worry about missing backslashes anymore, and the expressions you write are more intuitive.

import re

a=re.search(r"\\","ab123bb\c")
print a.group()
 \

 a=re.search(r"\d","ab123bb\c")
 print a.group()
 1
Copy the code

Python Re module

Python comes with the RE module, which provides support for regular expressions.

Match function

Re.match attempts to match a pattern from the start of the string, and match() returns None if the start position is not a successful match.

Here is the syntax for this function:

re.match(pattern, string, flags=0)
Copy the code

Here is a description of the parameters:

Parameter description Pattern This is a regular expression for matching. String This is the string, which will be searched to match the pattern at the beginning of the string. Flags Flags that control the matching mode of regular expressions, such as case – sensitive and multi-line matching.

The re.match method returns a matching object on success, or None otherwise.

We can use the group(num) or groups() match object functions to get a match expression.

Group (num=0) This method returns the entire match (or specifies the group num) groups() This method returns all subgroups matched by tuples (empty, if none)

Example:

import re line = "Cats are smarter than dogs" matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I) if matchObj: print ("matchObj.group() : ", matchObj.group()) print ("matchObj.group(1) : ", matchObj.group(1)) print ("matchObj.group(2) : ", matchObj.group(2)) else: print ("No match!!" )Copy the code

When the above code is executed, it produces the following result:

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter
Copy the code

Regular expression modifier – Option flag

Regular expression literals can contain an optional modifier to control various aspects of the match. The modifier is specified as an optional flag. Can use different or provide multiple modifiers (|), as shown in previous, and can be represented by one of these:

Re.i (re.ignorecase) makes match case insensitive Re.m (MULTILINE) multi-line match, affects ^ and $re.s (DOTALL) makes. Re.x (VERBOSE) regular expressions can be multiple lines, ignore whitespace characters, and can add comments

The.findall () function

re.findall(pattern, string, flags=0)

Returns a non-overlapping match of all patterns in a string, as a list of strings. The string scans left to right and matches the returned order found

The default:  pattren = "\w+" target = "hello world\nWORLD HELLO" re.findall(pattren,target) ['hello', 'world', 'WORLD', 'HELLO'] re.I: re.findall("world", target,re.I) ['world', 'WORLD'] re.S: re.findall("world.WORLD", target,re.S) ["world\nworld"] re.findall("hello.*WORLD", target,re.S) ['hello world\nWORLD'] re.M: re.findall("^WORLD",target,re.M) ["WORLD"] re.X: ReStr = "' \ d {3} # code - \ d {8} ' '# number re. The.findall (reStr," 010-12345678 ", re. X) [" 010-12345678 "]Copy the code

The search function

Re.search scans the entire string and returns the first successful match.

Here is the syntax for this function:

re.search(pattern, string, flags=0)
Copy the code

Here are the parameters:

Parameter description Pattern This is a regular expression for matching. String This is the string that will be searched to match the pattern of any position in the string. Flags Flags that control the matching mode of regular expressions, such as case – sensitive and multi-line matching.

The re.search method returns a matching object on success, or None otherwise.

We can use the group(num) or groups() match object functions to get a match expression.

Group (num=0) This method returns the entire match (or specifies the group num) groups() This method returns all subgroups matched by tuples (empty, if none)

Example:

import re
line = "Cats are smarter than dogs";
searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)

if searchObj:
   print ("searchObj.group() : ", searchObj.group())

   print ("searchObj.group(1) : ", searchObj.group(1))

   print ("searchObj.group(2) : ", searchObj.group(2))

else:
   print "Nothing found!!"
Copy the code

When the above code is executed, it produces the following result:

matchObj.group() :  Cats are smarter than dogs

matchObj.group(1) :  Cats

matchObj.group(2) :  smarter
Copy the code

Re.match differs from Re.search

Re.match matches only the beginning of the string. If the beginning of the string does not match the regular expression, the match fails and None is returned. Re.search matches the entire string until a match is found.

Example:

import re line = "Cats are smarter than dogs"; matchObj = re.match( r'dogs', line, re.M|re.I) if matchObj: print ( "match --> matchObj.group() : ", matchObj.group()) else: print ( "No match!!" ) searchObj = re.search( r'dogs', line, re.M|re.I) if searchObj: print ( "search --> searchObj.group() : ", searchObj.group()) else: print ( "Nothing found!!" )Copy the code

When the above code is executed, the following results are produced:

No match!!
search --> matchObj.group() :  dogs
Copy the code

Search and replace

Python’s re module provides re.sub to replace matches in strings.

grammar

re.sub(pattern, repl, string, max=0)
Copy the code

The string returned is replaced by the leftmost non-repeating match of the RE in the string.

If the pattern is not found, the character is returned unchanged. The optional parameter count is the maximum number of times a pattern can be replaced after matching. Count must be a non-negative integer. The default value is 0 to replace all matches. Example:

example

Here is an example of a crawler doing page flipping:

import re

url = "http://hr.t encent.com/position.php?&start=10"
page = re.search('start=(\d+)',url).group(1)

nexturl = re.sub(r'start=(\d+)', 'start='+str(int(page)+10), url)

print ("Next Url : ", nexturl)
Copy the code

When the above code is executed, the following results are produced:

Next Url :  http://hr.tencent.com/position.php?&start=20
Copy the code

Regular expression syntax

The following table lists the regular expression syntax available in Python:

! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/d9827c20ebf84eeeb5c8da7e8de20b5b)