Web crawler page parsing

preface

With the rapid development of the Internet, more and more information is flooded With each major network platform. Just as the law of large numbers mentioned by L. Lawliet in The Death Note, there must be some law in many complicated data, and some inevitable occurrence must be included in the accident. Whether we talk about the law of large Numbers, or the recent hot cannot leave a lot of big data or other fields and clean data support, therefore, web crawler can meet our requirements, or on the Internet by our willingness to climb any we want to get information, so that we can analyze the necessity, and then make the right decisions. Similarly, in our ordinary process of surfing the Internet, we can see the shadow of crawlers all the time, for example, we are widely known as “Baidu” is one of the large and worthy of the name “SPIDER KING”. To write a powerful crawler, you need to be proficient in web page parsing. This article will show you how to use the major web page parsing tools in Python.

The common parsing methods mainly include regular, Beautiful Soup, XPath, and PyQuery. This article mainly explains the use of the latter three tools, while the use of regular expressions is not explained here. Readers who are interested in regular expressions can skip to regular expressions

Use of Beautiful Soup
The use of XPath
The use of pyquery
Beautiful Soup, XPath, pyQuery

This is how Beautiful Soup should be used

Beautiful Soup is one of the Python crawler parsing tools for HTML and XML that can easily extract the data you want from a page. In addition, Beautiful Soup provides us with the following four parsers:

The standard library,soup = BeautifulSoup(content, "html.parser")
LXML parser,soup = BeautifulSoup(content, "lxml")
XML parser,soup = BeautifulSoup(content, "xml")
Html5lib parser,soup = BeautifulSoup(content, "html5lib")

Among the above four parsing libraries, LXML parsing has a merits of fast parsing speed and strong error compatibility. Considering the overall performance, this paper mainly uses THE LXML parser. Next, we mainly take the HTML of Baidu home page to explain how Beautiful Soup should be used. Let’s take a look at this little reptile:

from bs4 import BeautifulSoup
import requests

if __name__ == "__main__":
    response = requests.get("https://www.baidu.com")
    encoding = response.apparent_encoding
    response.encoding = encoding
    print(BeautifulSoup(response.text, "lxml"))
Copy the code

Code interpretation:

First of all, we use the Requests library in Python to make a get request for Baidu, and then obtain its encoding and modify it to the corresponding encoding format to prevent the possibility of garbled code. Finally, BeautifluSoup converts it into a parsable form using LXML.

response = requests.get("https://www.baidu.com"), requests for Baidu links
encoding = response.apparent_encodingTo get the page encoding format
response.encoding = encoding, modify the request encoding to the corresponding encoding format of the page to avoid garbled characters
print(BeautifulSoup(response.text, "lxml")), using LXML parser to analyze baidu home PAGE HTML and print the results

The printed result is as follows:

<! DOCTYPEhtml>
<! --STATUS OK--><html> <head><meta content="text/html; charset=utf-8" http-equiv="content-type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="always" name="referrer"/><link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/><title>Google it and you'll see</title></head> <body link="#0000cc"> <div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>#... . Several outputs are omitted hereCopy the code

From the above code, we can see the content of the print is too messy, in order to make the page after parsing and we is more clear, we can use prettify () method to carries on the standard of the indentation operation, in order to facilitate interpretation, tao and the results are appropriate to delete, leaving only valuable content, source and the output is as follows:

bd_soup = BeautifulSoup(response.text, "lxml")
print(bd_soup.prettify())
Copy the code

<html>
 <head>
  <title>Google it and you'll see</title>
 </head>
 <body link="#0000cc">
  <div id="wrapper">
   <div id="head">
    <div class="head_wrapper">
     <div class="s_form">
      <div class="s_form_wrapper">
       <div id="lg">
        <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>
       </div>
      </div>
     </div>#... . Several outputs are omitted here</div>
  </div>
 </body>
</html>
Copy the code

Node selection

In Beautiful Soup, we can easily select the desired node by using it in the BD_soup object. Its built-in API is rich, which is completely enough for our actual use. Its use rules are as follows:

bd_title_bj = bd_soup.title         Get the title content in the HTML
bd_title_bj_name = bd_soup.title.name   #.name Gets the name of the corresponding node
bd_title_name = bd_soup.title.string    
bd_title_parent_bj_name = bd_soup.title.parent.name #.parent Gets the parent node
bd_image_bj = bd_soup.img   Img Gets the IMG node
bd_image_bj_dic = bd_soup.img.attrs Attrs Gets the value of the attribute
bd_image_all = bd_soup.find_all("img")  # find_all to find all specified nodes
bd_image_idlg = bd_soup.find("div".id="lg")     Use the class attribute to find the node
Copy the code

The result of parsing in the above code is printed as follows, you can understand the meaning of the output:

</title> </title> Head <img height="129" hidefocus="true" SRC ="//www.baidu.com/img/bd_logo1.png" width="270"/> {'hidefocus': 'true', 'src': '//www.baidu.com/img/bd_logo1.png', 'width': '270', 'height': '129'} [<img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>, <img src="//www.baidu.com/img/gs.gif"/>] <div id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/> </div>Copy the code

Code interpretation:

bd_soup.titleAs mentioned earlier, Beautiful Soup can easily parse the corresponding page, just by usingbd_soup.The way to select the node, the code is to get baidu home HTMLtitleNode content
bd_soup.title.name, the use of.nameGets the node name in the form of
bd_soup.title.string, the use of.stringIn the form of can get the content of the node, this code is to get the content of the title node of baidu home page, that is, the browser navigation bar is displayedGoogle it and you'll see
bd_soup.title.parent.name, the use of.parentYou can use the parent node of the node, which is the node corresponding to the upper layer of the node.nameGets the name of the parent node
bd_soup.img, such asbd_soup.titleAgain, this code getsimgNode, but note that in the HTML above we can see there are two of themimgNode, while if used.imgThe default is to get the first one in the HTMLimgNodes, not all
bd_soup.img.attrsTo obtainimgThe output result of this code is a dictionary format of key-value pairs, so we only need to obtain the corresponding content of attributes through dictionary operations. Such asbd_soup.img.attrs.get("src")andbd_soup.img.attrs["src"]To obtainimgCorresponding to the nodesrcProperty, that is, the image link
bd_soup.find_all("img"), in the above.imgThe operation can only get the first one by defaultimgNode, and to get all of theimgNode, we need to use.find_all("img")Method returns a list of all selected nodes
bd_soup.find("div", id="lg")In practice, we often choose the specified node, which we can use.find()Method, which can pass in the desired node to find the attributes, here need to note that: in the passclassThe way to write it is property.find("div", class_="XXX")The way. So this line of code says getidProperties forlgthedivThe node, in addition, is above.find_all()You can also use this method to get all nodes corresponding to the specified property

Data extraction

In the previous section on node selection, we covered some of the methods used to extract data, but Beautiful Soup’s power doesn’t stop there. We continue to unravel the mystery. (Note: the following are just some commonly used apis. If you have higher requirements, please check the official documentation.)

.get_text()

Get all the text content of the object (that is, the text we can see on the page) :

all_content = bd_soup.get_text()
Copy the code

Google it, You know the news of hao123 map video posted at login document. Write (' < a href = "http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=" + encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&") + "bdorz_come = 1") + 'tj_login "name =" "class =" lb "> login < / a >'); Baidu ©2017 Baidu must read before using Baidu Feedback Beijing ICP Certificate 030173Copy the code

Strings, stripped_strings

print(type(bd_soup.strings))
# <class 'generator'>
Copy the code

Strings is used to extract all the contents of the bD_soup object, and we can see from the output above that the type of strings is a generator, for which we can use a loop to extract the contents. However, when we use strings, we will find that the extracted content has a lot of Spaces and line breaks, so we can use the.stripped_strings method to solve this problem, as follows:

for each in bd_soup.stripped_strings:
    print(each)
Copy the code

Output result:

Baidu ©2017 Baidu before using Baidu must read feedback Beijing ICP certificate 030173 NoCopy the code

.parent，.children，.parents

.parent selects the parent node of the node,.children selects the child node of the node,.parents selects all the upper nodes of the node, and returns the generator type.

bd_div_bj = bd_soup.find("div", id="u1")
print(type(bd_div_bj.parent))
print("*" * 50)
for child in bd_div_bj.children:
	print(child)
print("*" * 50)
for parent in bd_div_bj.parents:
	print(parent.name)
Copy the code

Result output:

<class 'bs4.element.Tag'> ************************************************** <a class="mnav" Href ="http://news.baidu.com" name="tj_trnews"> news </a> <a class="mnav" href="https://www.hao123.com" Name ="tj_trhao123">hao123</a> <a class="mnav" href="http://map.baidu.com" name="tj_trmap"> map </a> <a class="mnav" Href ="http://v.baidu.com" name="tj_trvideo"> </a> <a class="mnav" href="http://tieba.baidu.com" Name = "tj_trtieba" > post bar < / a > * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * div div div body HTMLCopy the code

Beautiful Soup summary

The main usage of Beautiful Soup is the above, and there are other operations that are not used very much in the actual development process. Therefore, it is relatively easy to use Beautiful Soup. Some other operations can be found in the official documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#contents-children

XPath parses pages

XPath, which stands for XML Path Language, can be used to parse BOTH XML and HTML. Having covered some of the common operations of Beautiful Soup in the last section, let’s continue to look at the use of XPath to see how powerful it is and how popular it is with many developers. As with Beautiful Soup, XPath provides a very concise way to select nodes, and Beautiful Soup mostly does it through. Is used to select child nodes or descendant nodes. In XPath, nodes are mainly selected by /. In addition, there are a number of built-in functions in XPath to handle matching relationships between individual data.

First, let’s look at some common node matching rules in XPath:

expression	explain
`/`	Selects a direct child node from the current node
`//`	Selects descendant nodes from the current node
`.`	Select the current node
`.`	Selects the parent of the current node
`@`	Specify attributes (id, class……)

Let’s continue to use the HTML of baidu’s home page to explain the use of XPath.

Node selection

In order to use Xpath properly, we first need to import the corresponding module correctly. In this case, we usually use LXML. Here is an example:

from lxml import etree
import requests
import html

if __name__ == "__main__":
    response = requests.get("https://www.baidu.com")
    encoding = response.apparent_encoding
    response.encoding = encoding
    print(response.text)
    bd_bj = etree.HTML(response.text)
    bd_html = etree.tostring(bd_bj).decode("utf-8")
    print(html.unescape(bd_html))
Copy the code

Lines 1 to 9 are consistent with Beautiful Soup, and the following code is explained:

etree.HTML(response.text), the use ofetreeIn the moduleHTMLClass to Baidu HTML(response.text)Initialize to construct an XPath parsing object of type<Element html at 0x1aba86b1c08>
etree.tostring(bd_html_elem).decode("utf-8"), convert the above object to a string type and encode asutf-8
html.unescape(bd_html), using the HTML5 standard defined rules willbd_htmlConvert to the corresponding Unicode character.

The printed result is the same as Beautiful Soup, so it is no longer shown here, and readers who do not know can scroll back. Now that we have an XPath-parsable object (bd_bj), we need to select a node for that object. As we mentioned above, Xpath mainly extracts nodes by /. Here are some common operations for node selection in Xpath:

All_bj = bd_bj. Xpath (" / / * ") # all the nodes selected img_bj = bd_bj. Xpath (" / / img ") # specified name selected node p_a_zj_bj = bd_bj. Xpath (/ / p/a) # selected node directly P_a_all_bj = bd_bj. Xpath (" / / p / / a ") # all the nodes selected head_bj = bd_bj. Xpath (" / / title /.." Select the parent nodeCopy the code

The results are as follows:

[<Element html at 0x14d6a6d1c88>, <Element head at 0x14d6a6e4408>, <Element meta at 0x14d6a6e4448>, <Element meta at 0x14d6a6e4488>, <Element meta at 0x14d6a6e44c8>, <Element link at 0x14d6a6e4548>, <Element title at 0x14d6a6e4588>, <Element body at 0x14d6a6e45c8>, <Element div at 0x14d6a6e4608>, <Element div at 0x14d6a6e4508>, <Element div at 0x14d6a6e4648>, <Element div at 0x14d6a6e4688>, ......]  [<Element img at 0x14d6a6e4748>, <Element img at 0x14d6a6e4ec8>] [<Element a at 0x14d6a6e4d88>, <Element a at 0x14d6a6e4dc8>, <Element a at 0x14d6a6e4e48>, <Element a at 0x14d6a6e4e88>] [<Element a at 0x14d6a6e4d88>, <Element a at 0x14d6a6e4dc8>, <Element a at 0x14d6a6e4e48>, <Element a at 0x14d6a6e4e88>] [<Element head at 0x14d6a6e4408>]Copy the code

all_bj = bd_bj.xpath("//*"), the use of//You can select the current node(html)All descendant nodes under, and is returned in the form of a list, with the list element passedbd_bjThe same iselementObject, the following return type is the same
img_bj = bd_bj.xpath("//img"), select the node with the specified name under the current node. It is suggested that the memory can be enhanced by comparing with the use of Beautiful Soup.find_all("img")In the form of
p_a_zj_bj = bd_bj.xpath("//p/a"), selects all under the current nodepAn immediate child of a nodeaThe node that needs to be noted here is ** “directly” **, ifanotpThe direct child node of the node fails to be selected
p_a_all_bj = bd_bj.xpath("//p//a"), selects all under the current nodepAll descendants of the nodeaNode, where ** “all” ** is distinguished from the previous operation
head_bj = bd_bj.xpath("//title/.." )Under the current nodetitleThe parent node of the node, i.eheadnode

Data extraction

Now that we know how to select the specified node, we need to extract the data contained in the node, as shown in the following example:

img_href_ls = bd_bj.xpath("//img/@src")
img_href = bd_bj.xpath("//div[@id='lg']/img[@hidefocus='true']/@src")
a_content_ls = bd_bj.xpath("//a//text()")
a_news_content = bd_bj.xpath("//a[@class='mnav' and @name='tj_trnews']/text()")
Copy the code

Output result:

[' / / www.baidu.com/img/bd_logo1.png ', '/ / www.baidu.com/img/gs.gif'] [' / / www.baidu.com/img/bd_logo1.png '] [' news', 'hao123', 'map', 'video' and 'post bar', 'login', 'more products',' About Baidu ', 'the About Baidu', 'before using Baidu required', 'feedback'] [' news']Copy the code

img_href_ls = bd_bj.xpath("//img/@src"), the code first selects all under the current nodeimgNodes, and then allimgThe node’ssrcProperty values are selected and a list is returned
img_href = bd_bj.xpath("//div[@id='lg']/img[@hidefocus='true']/@src"), the code first selects all under the current nodeidAttribute values forlgthedivAnd then continue to selectdivAn immediate child of a nodeimgNode (hidefoucus=true), and finally choose one of themsrcAttribute values
a_content_ls = bd_bj.xpath("//a//text()"), selects all of the current nodeaThe text content encountered by the node
a_news_content = bd_bj.xpath("//a[@class='mnav' and @name='tj_trnews']/text()"), multi-attribute selection. In xpath, you can specify nodes that satisfy multiple attributesandCan be

Tips: The reader should be careful to match the code to the output as he reads it. It is easy to remember the meaning as long as he understands it and has some training.

XPath summary

After reading the usage of XPath, the smart reader should find that Beautiful Soup and XPath are basically the same in essence and ideas. As long as you keep thinking about the usage of XPath in your head, you will be able to grasp the basic usage of XPath by now.

Get started with PyQuery

The official explanation for PyQuery is as follows:

pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jquery. pyquery uses lxml for fast xml and html manipulation.

This is not (or at least not yet) a library to produce or interact with javascript code. I just liked the jquery API and I missed it in Python so I told myself “Hey let’s make jquery in Python” This is the result. It can be used for many purposes, one idea that I might try in the future is to use it for templating with pure http templates that you modify using pyquery. I can also be used for web scrapping or for theming applications with Deliverance. The project is being actively developped on a git repository on Github. I have the policy of giving push access to anyone who wants it and then to review what he does. So if you want to contribute just email me. Please report bugs on the github issue tracker.

In addition to Beautiful Soup and XPath, qyQuery also exists in the web page parsing process. Qyquery is also popular with many “spider er” users. Let’s take a look at the use of QyQuery.

Node selection

In contrast to Beautiful Soup and XPath, qyQuery allows you to parse requests with HTML, urls, and the path to the local HTML file. Readers can code according to their own habits in the process of actual use. Let’s look at the expressions of these three ways:

import requests from pyquery import PyQuery as pq bd_html = requests.get("https://www.baidu.com").text bd_url = "https://www.baidu.com" bd_path = "./bd.html" Def way2(url): return pq(url=url) def way3(path): return pq(filename=path) print(type(way1(html=bd_html))) print(type(way2(url=bd_url))) print(type(way3(path=bd_path))) #  <class 'pyquery.pyquery.PyQuery'> # <class 'pyquery.pyquery.PyQuery'> # <class 'pyquery.pyquery.PyQuery'>Copy the code

In qyQuery, CSS selectors are used to extract the valid information we want from the page by using this object. We will continue to use the baidu home page to explain the use of PyQuery, where we assume that the parse object is bd_bj.

response = requests.get("https://www.baidu.com") response.encoding = "utf-8" bd_bj = pq(response.text) bd_title = bd_bj("title") bd_img_ls = bd_bj("img") bd_img_ls2 = bd_bj.find("img") bd_mnav = bd_bj(".mnav") bd_img = bd_bj("#u1 a") Bd_a_video = bd_bj("# u1.mnav ") # <title> </title> # <img hidefocus="true" SRC ="//www.baidu.com/img/bd_logo1.png" width="270" height="129"/> <img src="//www.baidu.com/img/gs.gif"/> # ...... The output result is longer, and the reader can run it by himselfCopy the code

As shown in the above code, pyquery when extract nodes usually have three ways, one is to directly extract the node name can extract the node, this way, of course, you can also use the find method, this way to extract the node is not add any attributes defined, so to extract the node will often contain multiple, So we can use the.items() loop to do this; One is to extract a node with a particular class attribute, which takes the form of.+class attribute values. Another way is to extract nodes with a specific ID attribute, which takes the #+ ID attribute value. Readers familiar with CSS should not be difficult to understand the above method of extracting nodes, which is exactly the method of extracting nodes in CSS and then styling them. You can also use a mixture of the above three methods as well as extract bD_A_video

Data extraction

In the actual process of parsing web pages, the three methods of parsing are basically the same. For readers to understand the operation of PyQuery data extraction and the blogger’s future reference, here is a brief introduction

img_src1 = bd_bj("img").attr("src") # //www.baidu.com/img/bd_logo1.png img_src2 = bd_bj("img").attr.src # //www.baidu.com/img/bd_logo1.png for each in bd_bj.find("img").items(): Print (each.attr(" SRC ")) print(bd_bj("title").text()Copy the code

SRC =.attr(” SRC “); SRC =.attr.src; SRC =.attr(” SRC “); In the node extraction summary, we said that all nodes meeting the conditions are extracted without limiting attributes, so the extracted attribute in this case is the first node attribute. To extract all of the node attributes, we can use four or five lines of code with.items() and iterate over them, and then extract the node attributes as before. Qyquery extracts the text of the node using.text() as in line 7.

Pyquery summary

Pyquery parses like Beautiful Soup and XPath have the same idea, so this is just a brief introduction, and readers who want to learn more can consult the official documentation and be more proficient with it.

Fourth, Tencent recruitment network analysis actual combat

For those of you who have read Beautiful Soup, XPath, and PyQuery, let’s use a simple case study to reinforce the three parsing methods. The website for this analysis is Tencent Recruitment website, the url:hr.tencent.com/, the homepage of its social recruitment website is as follows:

Our task this time is to use the above three analytical tools respectively to receive all the data in the social recruitment under the website.

Web page Analysis:

Through the social recruitment homepage of this website, we can find the following three main information:

Home page url link for hr.tencent.com/position.ph…
There are 288 pages of data, 10 jobs per page, for a total of 2,871 jobs
There are five data fields, namely: job name, job category, number of recruits, work location, and job Posting time

Since we parse all the job data under the website, and we stayed on the first page and found no other valuable information, it is better to enter the second page, then we can find that there is an obvious change in the URL link of the website, that is, the original link is submitted on the user sidestartParameter, where the link isHr.tencent.com/position.ph…After opening the following pages, it is not difficult to find the rule: each page submittedstartThe parameters are incremented with 10 tolerance. After that, we use Google developer tools to review the web page, we can find that the whole site is static pages, this we save a lot of trouble to parse, we need the data is static placed intableInside the label, as follows:

Below, we will use the above three tools respectively to analyze all the job data of the station.

Example source code

import requests
from bs4 import BeautifulSoup
from lxml import etree
from pyquery import PyQuery as pq
import itertools
import pandas as pd

class TencentPosition() :

	Function: Define initial variable parameter: start: start data
	def __init__(self, start) :
		self.url = "https://hr.tencent.com/position.php?&start={}#a".format(start)
		self.headers = {
			"Host": "hr.tencent.com"."User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
		}
		self.file_path = "./TencentPosition.csv"

	""" Function: request target page Parameter: URL: target link Headers: return: HTML, page source code ""
	def get_page(self, url, headers) : 
		res = requests.get(url, headers=headers)
		try:
			if res.status_code == 200:
				return res.text
			else:
				return self.get_page(url, headers=headers)
		except RequestException as e:
			return self.get_page(url, headers=headers)
	
	""" Function: Beautiful Soup parsing page parameter: HTML: request page source """
	def soup_analysis(self, html) :
		soup = BeautifulSoup(html, "lxml")
		tr_list = soup.find("table", class_="tablelist").find_all("tr")
		for tr in tr_list[1: -1]:
			position_info = [td_data for td_data in tr.stripped_strings]
			self.settle_data(position_info=position_info)

	Function: xpath parsing page parameter: HTML: request page source """
	def xpath_analysis(self, html) :
		result = etree.HTML(html)
		tr_list = result.xpath("//table[@class='tablelist']//tr")
		for tr in tr_list[1: -1]:
			position_info = tr.xpath("./td//text()")
			self.settle_data(position_info=position_info)
	
	"" function: PyQuery parses the page parameter: HTML: request page source """
	def pyquery_analysis(self, html) :
		result = pq(html)
		tr_list = result.find(".tablelist").find("tr")
		for tr in itertools.islice(tr_list.items(), 1.11):
			position_info = [td.text() for td in tr.find("td").items()]
			self.settle_data(position_info=position_info)

	Function: Position data integration parameter: POSItion_info: Field data list ""
	def settle_data(self, position_info) :
		position_data = {
				"Job Title": position_info[0].replace("\xa0".""),	# replace replaces \xa0 characters to prevent transcoding error
				"Job Category": position_info[1]."Number of recruits": position_info[2]."Place of Work": position_info[3]."Release time": position_info[-1],}print(position_data)
		self.save_data(self.file_path, position_data)

	Function: Data save parameter: file_path: file save path position_data: position data ""
	def save_data(self, file_path, position_data) :
		df = pd.DataFrame([position_data])
		try:
			df.to_csv(file_path, header=False, index=False, mode="a+", encoding="gbk")	Data is transcoded and stored on a newline
		except:
			pass
			
if __name__ == "__main__":
	for page, index in enumerate(range(287)) :print("Climbing position data on {} page :".format(page+1))
		tp = TencentPosition(start=(index*10))
		tp_html = tp.get_page(url=tp.url, headers=tp.headers)
		tp.pyquery_analysis(html=tp_html)
		print("\n")
Copy the code

Here are some of the results:

conclusion

In this article, we first introduce the common operations of Beautiful Soup, XPath, and PyQuery respectively, and then use these three parsing tools to crawl all the job data in Tencent Zhaoping.com to give readers a deeper understanding. In this case, because of the analytical method of this article is focused on web page, so don’t use multithreading, multi-process, crawl crawl all the data of the time in a minute or two, have the time later in the article will introduce the use of multi-thread multi-process again, and case, the analytical method has been introduced, so the reader to read the source code, If you have any questions about crawler page parsing, please contact Taoye or leave a comment below.

Note: All of the content in this article is common in the actual development of some operations, not all, to further improve the technical capabilities of the reader must read the official documentation.

Web crawler page parsing

Web crawler page parsing

preface

This is how Beautiful Soup should be used

Node selection

Data extraction

Beautiful Soup summary

XPath parses pages

Node selection

Data extraction

XPath summary

Get started with PyQuery

Node selection

Data extraction

Pyquery summary

Fourth, Tencent recruitment network analysis actual combat

Web page Analysis:

Example source code

conclusion

Related Posts

Remember a preflight request for cross-domain POST request data

Use Windows. RequestAnimationFrame animation

JS shallow copy and deep copy