Web crawler page parsing
preface
With the rapid development of the Internet, more and more information is flooded With each major network platform. Just as the law of large numbers mentioned by L. Lawliet in The Death Note, there must be some law in many complicated data, and some inevitable occurrence must be included in the accident. Whether we talk about the law of large Numbers, or the recent hot cannot leave a lot of big data or other fields and clean data support, therefore, web crawler can meet our requirements, or on the Internet by our willingness to climb any we want to get information, so that we can analyze the necessity, and then make the right decisions. Similarly, in our ordinary process of surfing the Internet, we can see the shadow of crawlers all the time, for example, we are widely known as “Baidu” is one of the large and worthy of the name “SPIDER KING”. To write a powerful crawler, you need to be proficient in web page parsing. This article will show you how to use the major web page parsing tools in Python.
The common parsing methods mainly include regular, Beautiful Soup, XPath, and PyQuery. This article mainly explains the use of the latter three tools, while the use of regular expressions is not explained here. Readers who are interested in regular expressions can skip to regular expressions
- Use of Beautiful Soup
- The use of XPath
- The use of pyquery
- Beautiful Soup, XPath, pyQuery
This is how Beautiful Soup should be used
Beautiful Soup is one of the Python crawler parsing tools for HTML and XML that can easily extract the data you want from a page. In addition, Beautiful Soup provides us with the following four parsers:
- The standard library,
soup = BeautifulSoup(content, "html.parser")
- LXML parser,
soup = BeautifulSoup(content, "lxml")
- XML parser,
soup = BeautifulSoup(content, "xml")
- Html5lib parser,
soup = BeautifulSoup(content, "html5lib")
Among the above four parsing libraries, LXML parsing has a merits of fast parsing speed and strong error compatibility. Considering the overall performance, this paper mainly uses THE LXML parser. Next, we mainly take the HTML of Baidu home page to explain how Beautiful Soup should be used. Let’s take a look at this little reptile:
from bs4 import BeautifulSoup
import requests
if __name__ == "__main__":
response = requests.get("https://www.baidu.com")
encoding = response.apparent_encoding
response.encoding = encoding
print(BeautifulSoup(response.text, "lxml"))
Copy the code
Code interpretation:
First of all, we use the Requests library in Python to make a get request for Baidu, and then obtain its encoding and modify it to the corresponding encoding format to prevent the possibility of garbled code. Finally, BeautifluSoup converts it into a parsable form using LXML.
response = requests.get("https://www.baidu.com")
, requests for Baidu linksencoding = response.apparent_encoding
To get the page encoding formatresponse.encoding = encoding
, modify the request encoding to the corresponding encoding format of the page to avoid garbled charactersprint(BeautifulSoup(response.text, "lxml"))
, using LXML parser to analyze baidu home PAGE HTML and print the results
The printed result is as follows:
<! DOCTYPEhtml>
<! --STATUS OK--><html> <head><meta content="text/html; charset=utf-8" http-equiv="content-type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="always" name="referrer"/><link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/><title>Google it and you'll see</title></head> <body link="#0000cc"> <div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>#... . Several outputs are omitted hereCopy the code
From the above code, we can see the content of the print is too messy, in order to make the page after parsing and we is more clear, we can use prettify () method to carries on the standard of the indentation operation, in order to facilitate interpretation, tao and the results are appropriate to delete, leaving only valuable content, source and the output is as follows:
bd_soup = BeautifulSoup(response.text, "lxml")
print(bd_soup.prettify())
Copy the code
<html>
<head>
<title>Google it and you'll see</title>
</head>
<body link="#0000cc">
<div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div class="s_form">
<div class="s_form_wrapper">
<div id="lg">
<img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>
</div>
</div>
</div>#... . Several outputs are omitted here</div>
</div>
</body>
</html>
Copy the code
Node selection
In Beautiful Soup, we can easily select the desired node by using it in the BD_soup object. Its built-in API is rich, which is completely enough for our actual use. Its use rules are as follows:
bd_title_bj = bd_soup.title Get the title content in the HTML
bd_title_bj_name = bd_soup.title.name #.name Gets the name of the corresponding node
bd_title_name = bd_soup.title.string
bd_title_parent_bj_name = bd_soup.title.parent.name #.parent Gets the parent node
bd_image_bj = bd_soup.img Img Gets the IMG node
bd_image_bj_dic = bd_soup.img.attrs Attrs Gets the value of the attribute
bd_image_all = bd_soup.find_all("img") # find_all to find all specified nodes
bd_image_idlg = bd_soup.find("div".id="lg") Use the class attribute to find the node
Copy the code
The result of parsing in the above code is printed as follows, you can understand the meaning of the output:
</title> </title> Head <img height="129" hidefocus="true" SRC ="//www.baidu.com/img/bd_logo1.png" width="270"/> {'hidefocus': 'true', 'src': '//www.baidu.com/img/bd_logo1.png', 'width': '270', 'height': '129'} [<img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>, <img src="//www.baidu.com/img/gs.gif"/>] <div id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/> </div>Copy the code
Code interpretation:
bd_soup.title
As mentioned earlier, Beautiful Soup can easily parse the corresponding page, just by usingbd_soup.
The way to select the node, the code is to get baidu home HTMLtitle
Node contentbd_soup.title.name
, the use of.name
Gets the node name in the form ofbd_soup.title.string
, the use of.string
In the form of can get the content of the node, this code is to get the content of the title node of baidu home page, that is, the browser navigation bar is displayedGoogle it and you'll see
bd_soup.title.parent.name
, the use of.parent
You can use the parent node of the node, which is the node corresponding to the upper layer of the node.name
Gets the name of the parent nodebd_soup.img
, such asbd_soup.title
Again, this code getsimg
Node, but note that in the HTML above we can see there are two of themimg
Node, while if used.img
The default is to get the first one in the HTMLimg
Nodes, not allbd_soup.img.attrs
To obtainimg
The output result of this code is a dictionary format of key-value pairs, so we only need to obtain the corresponding content of attributes through dictionary operations. Such asbd_soup.img.attrs.get("src")
andbd_soup.img.attrs["src"]
To obtainimg
Corresponding to the nodesrc
Property, that is, the image linkbd_soup.find_all("img")
, in the above.img
The operation can only get the first one by defaultimg
Node, and to get all of theimg
Node, we need to use.find_all("img")
Method returns a list of all selected nodesbd_soup.find("div", id="lg")
In practice, we often choose the specified node, which we can use.find()
Method, which can pass in the desired node to find the attributes, here need to note that: in the passclass
The way to write it is property.find("div", class_="XXX")
The way. So this line of code says getid
Properties forlg
thediv
The node, in addition, is above.find_all()
You can also use this method to get all nodes corresponding to the specified property
Data extraction
In the previous section on node selection, we covered some of the methods used to extract data, but Beautiful Soup’s power doesn’t stop there. We continue to unravel the mystery. (Note: the following are just some commonly used apis. If you have higher requirements, please check the official documentation.)
- .get_text()
Get all the text content of the object (that is, the text we can see on the page) :
all_content = bd_soup.get_text()
Copy the code
Google it, You know the news of hao123 map video posted at login document. Write (' < a href = "http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=" + encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&") + "bdorz_come = 1") + 'tj_login "name =" "class =" lb "> login < / a >'); Baidu ©2017 Baidu must read before using Baidu Feedback Beijing ICP Certificate 030173Copy the code
- Strings, stripped_strings
print(type(bd_soup.strings))
# <class 'generator'>
Copy the code
Strings is used to extract all the contents of the bD_soup object, and we can see from the output above that the type of strings is a generator, for which we can use a loop to extract the contents. However, when we use strings, we will find that the extracted content has a lot of Spaces and line breaks, so we can use the.stripped_strings method to solve this problem, as follows:
for each in bd_soup.stripped_strings:
print(each)
Copy the code
Output result:
Baidu ©2017 Baidu before using Baidu must read feedback Beijing ICP certificate 030173 NoCopy the code
- .parent,.children,.parents
.parent selects the parent node of the node,.children selects the child node of the node,.parents selects all the upper nodes of the node, and returns the generator type.
bd_div_bj = bd_soup.find("div", id="u1")
print(type(bd_div_bj.parent))
print("*" * 50)
for child in bd_div_bj.children:
print(child)
print("*" * 50)
for parent in bd_div_bj.parents:
print(parent.name)
Copy the code
Result output:
<class 'bs4.element.Tag'> ************************************************** <a class="mnav" Href ="http://news.baidu.com" name="tj_trnews"> news </a> <a class="mnav" href="https://www.hao123.com" Name ="tj_trhao123">hao123</a> <a class="mnav" href="http://map.baidu.com" name="tj_trmap"> map </a> <a class="mnav" Href ="http://v.baidu.com" name="tj_trvideo"> </a> <a class="mnav" href="http://tieba.baidu.com" Name = "tj_trtieba" > post bar < / a > * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * div div div body HTMLCopy the code
Beautiful Soup summary
The main usage of Beautiful Soup is the above, and there are other operations that are not used very much in the actual development process. Therefore, it is relatively easy to use Beautiful Soup. Some other operations can be found in the official documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#contents-children
XPath parses pages
XPath, which stands for XML Path Language, can be used to parse BOTH XML and HTML. Having covered some of the common operations of Beautiful Soup in the last section, let’s continue to look at the use of XPath to see how powerful it is and how popular it is with many developers. As with Beautiful Soup, XPath provides a very concise way to select nodes, and Beautiful Soup mostly does it through. Is used to select child nodes or descendant nodes. In XPath, nodes are mainly selected by /. In addition, there are a number of built-in functions in XPath to handle matching relationships between individual data.
First, let’s look at some common node matching rules in XPath:
expression | explain |
---|---|
/ |
Selects a direct child node from the current node |
// |
Selects descendant nodes from the current node |
. |
Select the current node |
. |
Selects the parent of the current node |
@ |
Specify attributes (id, class……) |
Let’s continue to use the HTML of baidu’s home page to explain the use of XPath.
Node selection
In order to use Xpath properly, we first need to import the corresponding module correctly. In this case, we usually use LXML. Here is an example:
from lxml import etree
import requests
import html
if __name__ == "__main__":
response = requests.get("https://www.baidu.com")
encoding = response.apparent_encoding
response.encoding = encoding
print(response.text)
bd_bj = etree.HTML(response.text)
bd_html = etree.tostring(bd_bj).decode("utf-8")
print(html.unescape(bd_html))
Copy the code
Lines 1 to 9 are consistent with Beautiful Soup, and the following code is explained:
etree.HTML(response.text)
, the use ofetree
In the moduleHTML
Class to Baidu HTML(response.text)
Initialize to construct an XPath parsing object of type<Element html at 0x1aba86b1c08>
etree.tostring(bd_html_elem).decode("utf-8")
, convert the above object to a string type and encode asutf-8
html.unescape(bd_html)
, using the HTML5 standard defined rules willbd_html
Convert to the corresponding Unicode character.
The printed result is the same as Beautiful Soup, so it is no longer shown here, and readers who do not know can scroll back. Now that we have an XPath-parsable object (bd_bj), we need to select a node for that object. As we mentioned above, Xpath mainly extracts nodes by /. Here are some common operations for node selection in Xpath:
All_bj = bd_bj. Xpath (" / / * ") # all the nodes selected img_bj = bd_bj. Xpath (" / / img ") # specified name selected node p_a_zj_bj = bd_bj. Xpath (/ / p/a) # selected node directly P_a_all_bj = bd_bj. Xpath (" / / p / / a ") # all the nodes selected head_bj = bd_bj. Xpath (" / / title /.." Select the parent nodeCopy the code
The results are as follows:
[<Element html at 0x14d6a6d1c88>, <Element head at 0x14d6a6e4408>, <Element meta at 0x14d6a6e4448>, <Element meta at 0x14d6a6e4488>, <Element meta at 0x14d6a6e44c8>, <Element link at 0x14d6a6e4548>, <Element title at 0x14d6a6e4588>, <Element body at 0x14d6a6e45c8>, <Element div at 0x14d6a6e4608>, <Element div at 0x14d6a6e4508>, <Element div at 0x14d6a6e4648>, <Element div at 0x14d6a6e4688>, ......] [<Element img at 0x14d6a6e4748>, <Element img at 0x14d6a6e4ec8>] [<Element a at 0x14d6a6e4d88>, <Element a at 0x14d6a6e4dc8>, <Element a at 0x14d6a6e4e48>, <Element a at 0x14d6a6e4e88>] [<Element a at 0x14d6a6e4d88>, <Element a at 0x14d6a6e4dc8>, <Element a at 0x14d6a6e4e48>, <Element a at 0x14d6a6e4e88>] [<Element head at 0x14d6a6e4408>]Copy the code
all_bj = bd_bj.xpath("//*")
, the use of//
You can select the current node(html)
All descendant nodes under, and is returned in the form of a list, with the list element passedbd_bj
The same iselement
Object, the following return type is the sameimg_bj = bd_bj.xpath("//img")
, select the node with the specified name under the current node. It is suggested that the memory can be enhanced by comparing with the use of Beautiful Soup.find_all("img")
In the form ofp_a_zj_bj = bd_bj.xpath("//p/a")
, selects all under the current nodep
An immediate child of a nodea
The node that needs to be noted here is ** “directly” **, ifa
notp
The direct child node of the node fails to be selectedp_a_all_bj = bd_bj.xpath("//p//a")
, selects all under the current nodep
All descendants of the nodea
Node, where ** “all” ** is distinguished from the previous operationhead_bj = bd_bj.xpath("//title/.." )
Under the current nodetitle
The parent node of the node, i.ehead
node
Data extraction
Now that we know how to select the specified node, we need to extract the data contained in the node, as shown in the following example:
img_href_ls = bd_bj.xpath("//img/@src")
img_href = bd_bj.xpath("//div[@id='lg']/img[@hidefocus='true']/@src")
a_content_ls = bd_bj.xpath("//a//text()")
a_news_content = bd_bj.xpath("//a[@class='mnav' and @name='tj_trnews']/text()")
Copy the code
Output result:
[' / / www.baidu.com/img/bd_logo1.png ', '/ / www.baidu.com/img/gs.gif'] [' / / www.baidu.com/img/bd_logo1.png '] [' news', 'hao123', 'map', 'video' and 'post bar', 'login', 'more products',' About Baidu ', 'the About Baidu', 'before using Baidu required', 'feedback'] [' news']Copy the code
img_href_ls = bd_bj.xpath("//img/@src")
, the code first selects all under the current nodeimg
Nodes, and then allimg
The node’ssrc
Property values are selected and a list is returnedimg_href = bd_bj.xpath("//div[@id='lg']/img[@hidefocus='true']/@src")
, the code first selects all under the current nodeid
Attribute values forlg
thediv
And then continue to selectdiv
An immediate child of a nodeimg
Node (hidefoucus=true
), and finally choose one of themsrc
Attribute valuesa_content_ls = bd_bj.xpath("//a//text()")
, selects all of the current nodea
The text content encountered by the nodea_news_content = bd_bj.xpath("//a[@class='mnav' and @name='tj_trnews']/text()")
, multi-attribute selection. In xpath, you can specify nodes that satisfy multiple attributesand
Can be
Tips: The reader should be careful to match the code to the output as he reads it. It is easy to remember the meaning as long as he understands it and has some training.
XPath summary
After reading the usage of XPath, the smart reader should find that Beautiful Soup and XPath are basically the same in essence and ideas. As long as you keep thinking about the usage of XPath in your head, you will be able to grasp the basic usage of XPath by now.
Get started with PyQuery
The official explanation for PyQuery is as follows:
pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jquery. pyquery uses lxml for fast xml and html manipulation.
This is not (or at least not yet) a library to produce or interact with javascript code. I just liked the jquery API and I missed it in Python so I told myself “Hey let’s make jquery in Python” This is the result. It can be used for many purposes, one idea that I might try in the future is to use it for templating with pure http templates that you modify using pyquery. I can also be used for web scrapping or for theming applications with Deliverance. The project is being actively developped on a git repository on Github. I have the policy of giving push access to anyone who wants it and then to review what he does. So if you want to contribute just email me. Please report bugs on the github issue tracker.
In addition to Beautiful Soup and XPath, qyQuery also exists in the web page parsing process. Qyquery is also popular with many “spider er” users. Let’s take a look at the use of QyQuery.
Node selection
In contrast to Beautiful Soup and XPath, qyQuery allows you to parse requests with HTML, urls, and the path to the local HTML file. Readers can code according to their own habits in the process of actual use. Let’s look at the expressions of these three ways:
import requests from pyquery import PyQuery as pq bd_html = requests.get("https://www.baidu.com").text bd_url = "https://www.baidu.com" bd_path = "./bd.html" Def way2(url): return pq(url=url) def way3(path): return pq(filename=path) print(type(way1(html=bd_html))) print(type(way2(url=bd_url))) print(type(way3(path=bd_path))) # <class 'pyquery.pyquery.PyQuery'> # <class 'pyquery.pyquery.PyQuery'> # <class 'pyquery.pyquery.PyQuery'>Copy the code
In qyQuery, CSS selectors are used to extract the valid information we want from the page by using this object. We will continue to use the baidu home page to explain the use of PyQuery, where we assume that the parse object is bd_bj.
response = requests.get("https://www.baidu.com") response.encoding = "utf-8" bd_bj = pq(response.text) bd_title = bd_bj("title") bd_img_ls = bd_bj("img") bd_img_ls2 = bd_bj.find("img") bd_mnav = bd_bj(".mnav") bd_img = bd_bj("#u1 a") Bd_a_video = bd_bj("# u1.mnav ") # <title> </title> # <img hidefocus="true" SRC ="//www.baidu.com/img/bd_logo1.png" width="270" height="129"/> <img src="//www.baidu.com/img/gs.gif"/> # ...... The output result is longer, and the reader can run it by himselfCopy the code
As shown in the above code, pyquery when extract nodes usually have three ways, one is to directly extract the node name can extract the node, this way, of course, you can also use the find method, this way to extract the node is not add any attributes defined, so to extract the node will often contain multiple, So we can use the.items() loop to do this; One is to extract a node with a particular class attribute, which takes the form of.+class attribute values. Another way is to extract nodes with a specific ID attribute, which takes the #+ ID attribute value. Readers familiar with CSS should not be difficult to understand the above method of extracting nodes, which is exactly the method of extracting nodes in CSS and then styling them. You can also use a mixture of the above three methods as well as extract bD_A_video
Data extraction
In the actual process of parsing web pages, the three methods of parsing are basically the same. For readers to understand the operation of PyQuery data extraction and the blogger’s future reference, here is a brief introduction
img_src1 = bd_bj("img").attr("src") # //www.baidu.com/img/bd_logo1.png img_src2 = bd_bj("img").attr.src # //www.baidu.com/img/bd_logo1.png for each in bd_bj.find("img").items(): Print (each.attr(" SRC ")) print(bd_bj("title").text()Copy the code
SRC =.attr(” SRC “); SRC =.attr.src; SRC =.attr(” SRC “); In the node extraction summary, we said that all nodes meeting the conditions are extracted without limiting attributes, so the extracted attribute in this case is the first node attribute. To extract all of the node attributes, we can use four or five lines of code with.items() and iterate over them, and then extract the node attributes as before. Qyquery extracts the text of the node using.text() as in line 7.
Pyquery summary
Pyquery parses like Beautiful Soup and XPath have the same idea, so this is just a brief introduction, and readers who want to learn more can consult the official documentation and be more proficient with it.
Fourth, Tencent recruitment network analysis actual combat
For those of you who have read Beautiful Soup, XPath, and PyQuery, let’s use a simple case study to reinforce the three parsing methods. The website for this analysis is Tencent Recruitment website, the url:hr.tencent.com/, the homepage of its social recruitment website is as follows:
Our task this time is to use the above three analytical tools respectively to receive all the data in the social recruitment under the website.
Web page Analysis:
Through the social recruitment homepage of this website, we can find the following three main information:
- Home page url link for hr.tencent.com/position.ph…
- There are 288 pages of data, 10 jobs per page, for a total of 2,871 jobs
- There are five data fields, namely: job name, job category, number of recruits, work location, and job Posting time
Since we parse all the job data under the website, and we stayed on the first page and found no other valuable information, it is better to enter the second page, then we can find that there is an obvious change in the URL link of the website, that is, the original link is submitted on the user sidestart
Parameter, where the link isHr.tencent.com/position.ph…After opening the following pages, it is not difficult to find the rule: each page submittedstart
The parameters are incremented with 10 tolerance. After that, we use Google developer tools to review the web page, we can find that the whole site is static pages, this we save a lot of trouble to parse, we need the data is static placed intable
Inside the label, as follows:
Below, we will use the above three tools respectively to analyze all the job data of the station.
Example source code
import requests
from bs4 import BeautifulSoup
from lxml import etree
from pyquery import PyQuery as pq
import itertools
import pandas as pd
class TencentPosition() :
Function: Define initial variable parameter: start: start data
def __init__(self, start) :
self.url = "https://hr.tencent.com/position.php?&start={}#a".format(start)
self.headers = {
"Host": "hr.tencent.com"."User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
}
self.file_path = "./TencentPosition.csv"
""" Function: request target page Parameter: URL: target link Headers: return: HTML, page source code ""
def get_page(self, url, headers) :
res = requests.get(url, headers=headers)
try:
if res.status_code == 200:
return res.text
else:
return self.get_page(url, headers=headers)
except RequestException as e:
return self.get_page(url, headers=headers)
""" Function: Beautiful Soup parsing page parameter: HTML: request page source """
def soup_analysis(self, html) :
soup = BeautifulSoup(html, "lxml")
tr_list = soup.find("table", class_="tablelist").find_all("tr")
for tr in tr_list[1: -1]:
position_info = [td_data for td_data in tr.stripped_strings]
self.settle_data(position_info=position_info)
Function: xpath parsing page parameter: HTML: request page source """
def xpath_analysis(self, html) :
result = etree.HTML(html)
tr_list = result.xpath("//table[@class='tablelist']//tr")
for tr in tr_list[1: -1]:
position_info = tr.xpath("./td//text()")
self.settle_data(position_info=position_info)
"" function: PyQuery parses the page parameter: HTML: request page source """
def pyquery_analysis(self, html) :
result = pq(html)
tr_list = result.find(".tablelist").find("tr")
for tr in itertools.islice(tr_list.items(), 1.11):
position_info = [td.text() for td in tr.find("td").items()]
self.settle_data(position_info=position_info)
Function: Position data integration parameter: POSItion_info: Field data list ""
def settle_data(self, position_info) :
position_data = {
"Job Title": position_info[0].replace("\xa0".""), # replace replaces \xa0 characters to prevent transcoding error
"Job Category": position_info[1]."Number of recruits": position_info[2]."Place of Work": position_info[3]."Release time": position_info[-1],}print(position_data)
self.save_data(self.file_path, position_data)
Function: Data save parameter: file_path: file save path position_data: position data ""
def save_data(self, file_path, position_data) :
df = pd.DataFrame([position_data])
try:
df.to_csv(file_path, header=False, index=False, mode="a+", encoding="gbk") Data is transcoded and stored on a newline
except:
pass
if __name__ == "__main__":
for page, index in enumerate(range(287)) :print("Climbing position data on {} page :".format(page+1))
tp = TencentPosition(start=(index*10))
tp_html = tp.get_page(url=tp.url, headers=tp.headers)
tp.pyquery_analysis(html=tp_html)
print("\n")
Copy the code
Here are some of the results:
conclusion
In this article, we first introduce the common operations of Beautiful Soup, XPath, and PyQuery respectively, and then use these three parsing tools to crawl all the job data in Tencent Zhaoping.com to give readers a deeper understanding. In this case, because of the analytical method of this article is focused on web page, so don’t use multithreading, multi-process, crawl crawl all the data of the time in a minute or two, have the time later in the article will introduce the use of multi-thread multi-process again, and case, the analytical method has been introduced, so the reader to read the source code, If you have any questions about crawler page parsing, please contact Taoye or leave a comment below.
Note: All of the content in this article is common in the actual development of some operations, not all, to further improve the technical capabilities of the reader must read the official documentation.