Use Python to crawl and parse a web page
- 1. Preparation of crawlers
-
- 1.1. Basic Python syntax
- 1.2. The overall idea of crawling a web page
-
- 1.2.1. Basic Concepts
- 1.2.2. Simple thinking
- 1.2.3. Detailed thinking
- 1.3. Third-party libraries to be installed
- 2. Code examples
-
- 2.1. The data is in the target URL
- 2.2. Data is returned via other urls
- 3. Code analysis
-
- 3.1. The data is in the target URL
- 3.2. Data is returned via other urls
- 3.3 summary
- 4. Principle explanation
- 5. Project address
1. Preparation of crawlers
1.1. Basic Python syntax
First, by the time you read this article, we assume that you are familiar with Python’s basic syntax and how to install Python third-party libraries. On this basis, we can continue to read.
1.2. The overall idea of crawling a web page
1.2.1. Basic Concepts
Web page: web page is the basic element of the website, is the platform bearing various web applications.
Example: Baidu home page
URL: On the WWW, each information resource has a uniform and unique address on the web. Example: Baidu home page www.baidu.com/
Web source code: The HTML file content of a web page.
Example: Google Browser open baidu home page, right click to view the source code
Debug mode: Google Chrome’s own developer tools.
Example: Baidu home page
PS: There isn’t much detail here about developer tools.
1.2.2. Simple thinking
- Open a specific web page
- Write code to access the page and return data
- Parse the data you want
A lot of times we will find that the crawler is sometimes not that smooth, sometimes it needs to add a lot of details, but the whole idea does not leave these three steps. Obviously this is not enough, so more detailed steps are needed
1.2.3. Detailed thinking
- Open a specific web page
- Look at the source code of the web page and find (CTRL+F) whether the data you are looking for is in the web page.
2.1. If yes, open developer mode and click Network to refresh. At this point you will find that the data you need is returned from the first url. The following
Finally, code was written to crawl the web page and parse it using xpath
2.2. If not, open developer mode and click Network to refresh. At this point you will notice that the data will not be returned in the first site as in 2.1. In this case, the data is hidden in other JS files. A little front-end development foundation is needed here. We need to find out which links the data resides in, and this is often done manually and empirically, with XHR data filtered first. As shown in figure:
Through the above case, we screened the word “Baidu” on baidu’s home page as shown in the following figure:
Finally, write code to crawl the web page and parse it using JSON (in most cases)
1.3. Third-party libraries to be installed
Requests LXML parses data from HTML and other filesCopy the code
2. Code examples
2.1. The data is in the target URL
Demo1: Climb bilibili popular information
from lxml import etree
import requests
Note: in developer tools, this URL refers to the first URL
url = "https://www.bilibili.com/v/popular/rank/all"
# Imitate the browser headers
headers = {
"user-agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
}
# get request, pass in parameters, return result set
resp = requests.get(url,headers=headers)
Convert the text of the result set to a tree structure
tree = etree.HTML(resp.text)
Define a list to store all data
dli = []
Pass through all data
for s in range(1.101):
li = []
Find the corresponding dataset according to the path of the tree
num = tree.xpath("/html/body/div[3]/div[2]/div[2]/ul/li["+str(s)+"]/div[1]/text()") Get hot search sort
name = tree.xpath("/html/body/div[3]/div[2]/div[2]/ul/li["+str(s)+"]/div[2]/div[2]/a/text()")# get the title
url = tree.xpath("/html/body/div[3]/div[2]/div[2]/ul/li["+str(s)+"]/div[2]/div[2]/a/@href")# fetch link
look = tree.xpath("/html/body/div[3]/div[2]/div[2]/ul/li["+str(s)+"]/div[2]/div[2]/div[1]/span[1]/text()")Get the number of plays
say = tree.xpath("/html/body/div[3]/div[2]/div[2]/ul/li["+str(s)+"]/div[2]/div[2]/div[1]/span[2]/text()") # Get comments
up = tree.xpath("/html/body/div[3]/div[2]/div[2]/ul/li["+str(s)+"]/div[2]/div[2]/div[1]/a/span/text()") Get the up master
score = tree.xpath("/html/body/div[3]/div[2]/div[2]/ul/li["+str(s)+"]/div[2]/div[2]/div[2]/div/text()") Get the overall score
Get the elements of the dataset
li.append(num[0])
li.append(name[0])
li.append(url[0])
li.append(look[0])
li.append(say[0])
li.append(up[0])
li.append(score[0])
dli.append(li)
# Print data
for dd in dli:
print(dd)
Copy the code
The result is as follows :(in this case, the data still needs to be cleaned further, for which you can refer to python’s replace method for handling strings.)
2.2. Data is returned via other urls
Demo2: Climbs the home page information of a specified bilibili user
Install the requests library if the requests library fails
import requests
Note: in developer tools this time is to find the corresponding URL of the data, not the first URL
url = "https://api.bilibili.com/x/space/arc/search"
# Imitate the browser headers
headers = {
"user-agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
}
# required parameters
params = {
"mid": 387694560."pn": 1."ps": 25."index": 1."jsonp": "jsonp"
}
Call the get method, pass in the parameter, and return the result set
resp = requests.get(url,headers=headers,params=params)
Convert the result to JS format
js = resp.json()
Get the data set we need in JS
infos = js['data'] ['list'] ['vlist']
The following code is traversal data
bli = []
for info in infos:
li = []
author = info['author']
bvid = info['bvid']
pic = info['pic']
title = info['title']
li.append(author)
li.append(bvid)
li.append(pic)
li.append(title)
bli.append(li)
Output complete data
for ll in bli:
print(ll)
Copy the code
The running results are as follows:
3. Code analysis
3.1. The data is in the target URL
The URL, HEADERS, and xpath matches in the above code are copied directly from the developer tools. Note: To copy the full path of xpath, first select the element you want from the developer tools arrow tool, right-click the element in Elements, and copy it. Note: Further adjustments will be made to the copied full xpath path
- To get the text of the label, add it after the path
/text()
- If you want to get the tag value for a tag of the element, add to the path
/@ Attribute name
- If you want to get multiple copies of the same data (e.g., rank 1,2… You can concatenate strings in xpath and loop through them.
3.2. Data is returned via other urls
This type of URL is no longer the first one, but the one we find with the data. The other rules are basically the same. But in getting data this place, is now parsing JS data, readers can look at the picture and compare the above code, you can know how to parse JS data.
3.3 summary
- The two methods listed above are only the most common encountered when using Python crawlers, and not all cases are appropriate
- These two cases are only applicable in general, but real business often involves handling cookies, handling anti-theft, multi-threading, multi-processing, and so on. However, since this article is for entry level only, this section is not included in this article.
- Most of the requests we see on the web are GET and POST, so we should pay attention to the method we use when we climb the page, and check whether the get method needs params parameter and the POST method needs form parameter.
4. Principle explanation
In fact, a lot of the data in a web page is not directly in HTML, some of it is rendered in JS, so depending on these two points, we can specify two different solutions. When the data is in HTML, we use xpath to parse the data; When the data is responding in JS format, it is retrieved directly from the hierarchy.
5. Project address
Github.com/zhizhangxue…