Regular expressions parse web pages

First turn the source code into a string, and then match the desired data with a regular expression

model describe model describe
. Matches any character except newline \s Matching whitespace characters
* Matches the preceding character 0 or more times \S Matches any non-whitespace character
+ Matches the preceding character 1 or more times \d Match a number [0 to 9]
? Matches the preceding character 0 or 1 times \D Matches any non-digit, [^0~9]
^ Matching the beginning of a string \w Match alphanumeric, [A-ZA-Z0-9]
$ Matching the end of a string \W Match non-alphanumeric, [^ A-za-z0-9]
( ) Matches the expression in parentheses, also representing a group [] Used to represent a group of characters

Re.match matches only from the start of the string

Syntax format

re.match(pattern,string,flags=0)
Copy the code

Pattern: indicates a regular expression

String: string to be matched

Flags: indicates the matching mode of the regular expression

import re
s="aaa bbb ccc ddd eee"
m=re.match(r'(.*) ccc (.*? ) ',s)
Matches the result of the entire string
print(m.group(0))
# (.*) result
print(m.group(1))
# (. *?) The results of the
print(m.group(2))
# result list
print(m.groups())
Copy the code

Re.search scans the entire string to return the first successful match

s="aaa bbb ccc ddd eee"
m1=re.match('ccc',s)
m2=re.search('ccc',s)
print(m1)
print(m2)
Copy the code

Re.findall finds all matches and returns them as a list

import re
s="aaa12313bbb6788ccc56789ddd eee"
m=re.findall('[0-9] +',s)
print(m)
Copy the code

Parse web pages

Get the title of the blog

Analysis of the HTML

Writing regular expressions

<h4 data-v-6fe2b6a7>(.*?) </h4> Gets the title in the middle of the H4 tagCopy the code
Parsing the web page using regular expressions using Requests to obtain the web page sourceCopy the code
import re
import requests
headers={
    'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; x64; The rv: 94.0) Gecko / 20100101 Firefox 94.0 / '
}
url='https://blog.csdn.net/weixin_42403632'
html=requests.get(url,headers=headers)
titles=re.findall('

(.*?)

'
,html.text) for i in titles: print(i) print(len(titles)) Copy the code