preface
Climb zhaopin national recruitment data using Python Requets + Selenium. If you have read my previous article, you should know that we have written a crawler that uses Selenium exclusively to climb Zhaopin
My purpose is to enter the page to get the link of the recruitment details page, and then crawl the data through the link
1. List page URL acquisition
Below is the URL of the list page, where jL can be directly replaced by province, kw is the search keyword, p is the number of pages
Revised: https://sou.zhaopin.com/?jl=532&kw=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90&p=1 https://sou.zhaopin.com/?jl= province &kw= keywords & P = pagesCopy the code
But there is one important point, we want to achieve full site crawl, must be a loop, but the total number of pages for each province is different, we must get the total number of pages for each province and make a LIST of urls to get the details page link
But as above, the total number of pages is hidden, then I thought of the method is to use the selenium access in the page number box enter a value greater than the total pages, he will automatically return to pages, in which one thing to note, zhaopin for each site visit will let you enter to sweep code landing again, so in the first visit to the page for the cookie, For your next visit
Def Get_Page(self): page_list=[] listCookies=[] print for jl in self.jl_list: url="https://sou.zhaopin.com/?jl=%s&kw=%s"%(jl,self.keyword) if len(listCookies) >0: for cookie in listCookies[0]: self.bro.add_cookie(cookie) else: pass self.bro.get(url) sleep(9) listCookies.append(self.bro.get_cookies()) try: self.bro.execute_script('window.scrollTo(0, Document. Body. ScrollHeight) ') # pulled down one screen at a time Self.bro.find_element_by_xpath ('//div[@class="soupager__pagebox"]]/input[@type="text"]').send_keys(1000) # Enter a value much larger than the number of pages in the pagebox button=self.bro.find_element_by_xpath('//div[@class="soupager__pagebox"]/button[@class="soupager__btn soupager__pagebox__gobtn"]') self.bro.execute_script("arguments[0].click();" , button) self.bro.execute_script('window.scrollTo(0, Document. Body. ScrollHeight) ') # pulled down page = one screen at a time self.bro.find_element_by_xpath('//div[@class="soupager"]/span[last()]').text except Exception as e: print(e) page=1 pass page_list.append(page) self.bro.quit() return page_listCopy the code
We know the composition of list URLS and obtain the maximum number of pages in each province, so we can accurately generate list urls in batches. I use “data analysis” as the key word, and finally obtain more than 300 list urls
Def Get_list_Url(self): list_url=[] print() for a, b in zip(self.jl_list, self.get_page ()): for i in range(int(b)): url = "https://sou.zhaopin.com/?jl=%s&kw=%s&p=%d" % (a,self.keyword,int(b)) with open("list_url.txt","a", encoding="utf-8") as f: List_url ("\n") list_url. Append (url) print(" list_url ") return list_urlCopy the code
Now that we have the list URL, we can access the URL to get the detail page URL for each job Posting
2. Details page URL acquisition
We access all the list urls in order to obtain the details of each recruitment information URL, right-click the element can be found, the details list is in the URL A tag
We can write code for that
def Parser_Url(self, url):
list_header = {
'user-agent': '',
'cookie': ''
}
try:
text = requests.get(url=url, headers=list_header).text
html = etree.HTML(text)
urls = html.xpath('//div[@class="joblist-box__item clearfix"]/a[1]/@href')
for url in urls:
with open("detail_url.txt","a", encoding="utf-8") as f:
f.write(url)
f.write("\n")
except Exception as e:
print(e)
pass
Copy the code
Now we need to get the details URL from the list URL
Def Get_detail_Url(self): print(list_url.txt) with open("list_url.txt","r",encoding=" utF-8 ") as f: def Get_detail_Url(self): print(" list_url.txt","r",encoding=" utF-8 ") as f: list_url=f.read().split("\n")[0:-1] for url in list_url: self.Parser_Url(url)Copy the code
3. Data acquisition
I got over 7,000 detailed urls, and we were able to use those urls to get the data
Start by writing a function that parses the data
def Parser_Data(self,text):
html = etree.HTML(text)
dic = {}
try:
dic["name"] = html.xpath('//h3[@class="summary-plane__title"]/text()')[0]
except:
dic["name"] = ""
try:
dic["salary"] = html.xpath('//span[@class="summary-plane__salary"]/text()')[0]
except:
dic["salary"] = ""
try:
dic["city"] = html.xpath('//ul[@class="summary-plane__info"]/li[1]/a/text()')[0]
except:
dic["city"] = ""
with open(".//zhilian.csv", "a", encoding="utf-8") as f:
writer = csv.DictWriter(f, dic.keys())
writer.writerow(dic)
Copy the code
Finally, we call the parse data function through the detail page URL
Def Get_Data(self): print(" start retrieving data ") detail_header = {'user-agent': '', 'cookie': '' } with open("detail_url.txt","r",encoding="utf-8") as f: detail_url=f.read().split("\n")[0:-1] for url in detail_url: Get (URL) # self.parser_data (self.bro.page_source) "" use requests" "# Text = requests. Get (url, headers = detail_header, proxies = {" HTTP ":" http://213.52.38.102:8080 "}). The text # self. Parser_Data (text) Sleep (0.2)Copy the code
If accessed using Requests, the request header information needs to be modified
conclusion
Using the Requests + Selenium approach speeds up data retrieval and significantly reduces time compared to the previous approach using selenium alone
Recently, many friends have sent messages to ask about learning Python. For easy communication, click on blue to join yourselfDiscussion solution resource base