preface

Climb zhaopin national recruitment data using Python Requets + Selenium. If you have read my previous article, you should know that we have written a crawler that uses Selenium exclusively to climb Zhaopin

My purpose is to enter the page to get the link of the recruitment details page, and then crawl the data through the link

1. List page URL acquisition

Below is the URL of the list page, where jL can be directly replaced by province, kw is the search keyword, p is the number of pages

Revised: https://sou.zhaopin.com/?jl=532&kw=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90&p=1 https://sou.zhaopin.com/?jl= province &kw= keywords & P = pagesCopy the code

But there is one important point, we want to achieve full site crawl, must be a loop, but the total number of pages for each province is different, we must get the total number of pages for each province and make a LIST of urls to get the details page link

But as above, the total number of pages is hidden, then I thought of the method is to use the selenium access in the page number box enter a value greater than the total pages, he will automatically return to pages, in which one thing to note, zhaopin for each site visit will let you enter to sweep code landing again, so in the first visit to the page for the cookie, For your next visit

Def Get_Page(self): page_list=[] listCookies=[] print for jl in self.jl_list: url="https://sou.zhaopin.com/?jl=%s&kw=%s"%(jl,self.keyword) if len(listCookies) >0: for cookie in listCookies[0]: self.bro.add_cookie(cookie) else: pass self.bro.get(url) sleep(9) listCookies.append(self.bro.get_cookies()) try: self.bro.execute_script('window.scrollTo(0, Document. Body. ScrollHeight) ') # pulled down one screen at a time Self.bro.find_element_by_xpath ('//div[@class="soupager__pagebox"]]/input[@type="text"]').send_keys(1000) # Enter a value much larger than the number of pages in the pagebox  button=self.bro.find_element_by_xpath('//div[@class="soupager__pagebox"]/button[@class="soupager__btn soupager__pagebox__gobtn"]') self.bro.execute_script("arguments[0].click();" , button) self.bro.execute_script('window.scrollTo(0, Document. Body. ScrollHeight) ') # pulled down page = one screen at a time self.bro.find_element_by_xpath('//div[@class="soupager"]/span[last()]').text except Exception as e: print(e) page=1 pass page_list.append(page) self.bro.quit() return page_listCopy the code

We know the composition of list URLS and obtain the maximum number of pages in each province, so we can accurately generate list urls in batches. I use “data analysis” as the key word, and finally obtain more than 300 list urls

Def Get_list_Url(self): list_url=[] print() for a, b in zip(self.jl_list, self.get_page ()): for i in range(int(b)): url = "https://sou.zhaopin.com/?jl=%s&kw=%s&p=%d" % (a,self.keyword,int(b)) with open("list_url.txt","a", encoding="utf-8") as f: List_url ("\n") list_url. Append (url) print(" list_url ") return list_urlCopy the code

Now that we have the list URL, we can access the URL to get the detail page URL for each job Posting

2. Details page URL acquisition

We access all the list urls in order to obtain the details of each recruitment information URL, right-click the element can be found, the details list is in the URL A tag

We can write code for that

def Parser_Url(self, url):
    list_header = {
        'user-agent': '',
        'cookie': ''
    }
    try:
        text = requests.get(url=url, headers=list_header).text
        html = etree.HTML(text)
        urls = html.xpath('//div[@class="joblist-box__item clearfix"]/a[1]/@href')
        for url in urls:
            with open("detail_url.txt","a", encoding="utf-8") as f:
                f.write(url)
                f.write("\n")
    except Exception as e:
        print(e)
        pass
Copy the code

Now we need to get the details URL from the list URL

Def Get_detail_Url(self): print(list_url.txt) with open("list_url.txt","r",encoding=" utF-8 ") as f: def Get_detail_Url(self): print(" list_url.txt","r",encoding=" utF-8 ") as f: list_url=f.read().split("\n")[0:-1] for url in list_url: self.Parser_Url(url)Copy the code

3. Data acquisition

I got over 7,000 detailed urls, and we were able to use those urls to get the data

Start by writing a function that parses the data

def Parser_Data(self,text):
    html = etree.HTML(text)
    dic = {}
    try:
        dic["name"] = html.xpath('//h3[@class="summary-plane__title"]/text()')[0]
    except:
        dic["name"] = ""
    try:
        dic["salary"] = html.xpath('//span[@class="summary-plane__salary"]/text()')[0]
    except:
        dic["salary"] = ""
    try:
        dic["city"] = html.xpath('//ul[@class="summary-plane__info"]/li[1]/a/text()')[0]
    except:
        dic["city"] = ""
    with open(".//zhilian.csv", "a", encoding="utf-8") as f:
        writer = csv.DictWriter(f, dic.keys())
        writer.writerow(dic)
Copy the code

Finally, we call the parse data function through the detail page URL

Def Get_Data(self): print(" start retrieving data ") detail_header = {'user-agent': '', 'cookie': '' } with open("detail_url.txt","r",encoding="utf-8") as f: detail_url=f.read().split("\n")[0:-1] for url in detail_url: Get (URL) # self.parser_data (self.bro.page_source) "" use requests" "# Text = requests. Get (url, headers = detail_header, proxies = {" HTTP ":" http://213.52.38.102:8080 "}). The text # self. Parser_Data (text) Sleep (0.2)Copy the code

If accessed using Requests, the request header information needs to be modified

conclusion

Using the Requests + Selenium approach speeds up data retrieval and significantly reduces time compared to the previous approach using selenium alone

Recently, many friends have sent messages to ask about learning Python. For easy communication, click on blue to join yourselfDiscussion solution resource base