Regular expressions parse web pages
First turn the source code into a string, and then match the desired data with a regular expression
model | describe | model | describe |
---|---|---|---|
. | Matches any character except newline | \s | Matching whitespace characters |
* | Matches the preceding character 0 or more times | \S | Matches any non-whitespace character |
+ | Matches the preceding character 1 or more times | \d | Match a number [0 to 9] |
? | Matches the preceding character 0 or 1 times | \D | Matches any non-digit, [^0~9] |
^ | Matching the beginning of a string | \w | Match alphanumeric, [A-ZA-Z0-9] |
$ | Matching the end of a string | \W | Match non-alphanumeric, [^ A-za-z0-9] |
( ) | Matches the expression in parentheses, also representing a group | [] | Used to represent a group of characters |
Re.match matches only from the start of the string
Syntax format
re.match(pattern,string,flags=0)
Copy the code
Pattern: indicates a regular expression
String: string to be matched
Flags: indicates the matching mode of the regular expression
import re
s="aaa bbb ccc ddd eee"
m=re.match(r'(.*) ccc (.*? ) ',s)
Matches the result of the entire string
print(m.group(0))
# (.*) result
print(m.group(1))
# (. *?) The results of the
print(m.group(2))
# result list
print(m.groups())
Copy the code
Re.search scans the entire string to return the first successful match
s="aaa bbb ccc ddd eee"
m1=re.match('ccc',s)
m2=re.search('ccc',s)
print(m1)
print(m2)
Copy the code
Re.findall finds all matches and returns them as a list
import re
s="aaa12313bbb6788ccc56789ddd eee"
m=re.findall('[0-9] +',s)
print(m)
Copy the code
Parse web pages
Get the title of the blog
Analysis of the HTML
Writing regular expressions
<h4 data-v-6fe2b6a7>(.*?) </h4> Gets the title in the middle of the H4 tagCopy the code
Parsing the web page using regular expressions using Requests to obtain the web page sourceCopy the code
import re
import requests
headers={
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; x64; The rv: 94.0) Gecko / 20100101 Firefox 94.0 / '
}
url='https://blog.csdn.net/weixin_42403632'
html=requests.get(url,headers=headers)
titles=re.findall('(.*?)
',html.text)
for i in titles:
print(i)
print(len(titles))
Copy the code