0, preface

The blogger is having trouble choosing his ﹏ < lately and would like to know the current employment prospects for different positions

At this time, might as well write a small crawler, climb pull hook net position data, and graphical method to show, at a glance

The overall idea is to use Selenium to simulate the behavior of the browser, with the following steps:

  1. Initialize the
  2. Crawl data, here is divided into two parts: one is to crawl web data, the other is to turn pages
  3. Save the data to a file
  4. Data visualization

The overall code structure is as follows:

class Lagou:
    # initialization
    def init(self) :
        pass

    # crawl web data
    def parse_page(self) :
        pass

    # Turn the page
    def turn_page(self) :
        pass

    # crawl data, call parse_page and turn_page
    def crawl(self) :
        pass

    Save the data to a file
    def save(self) :
        pass
    
    # Data Visualization
    def draw(self) :
        pass

if __name__ == '__main__':
    obj = Lagou()
    obj.init()
    obj.crawl()
    obj.save()
    obj.draw()
Copy the code

Well, let’s look at the detailed analysis of the whole crawler process!

1. Initialization

In the initialization part, we need to complete the following four aspects:

  1. Prepare global variables
  2. Start the browser
  3. Open the start URL
  4. Set the cookie

(1) Prepare global variables

The so-called global variables refer to the variables needed in the whole crawler process. Here we define two global variables:

  • Data: Stores crawled data
  • IsEnd: determines whether the crawl is complete

(2) Start the browser

The way to start a browser can be roughly divided into two kinds, one is normal startup, the second is headless startup

During normal startup, the entire crawl process can be visualized, making it easy to find errors during debugging

from selenium import webdriver
self.browser = webdriver.Chrome()
Copy the code

Headless boot can reduce rendering time and speed up the crawl process, which is usually used for formal crawls

from selenium import webdriver
opt = webdriver.chrome.options.Options()
opt.set_headless()
self.browser = webdriver.Chrome(chrome_options = opt)
Copy the code

(3) Open the start URL

First, we open the homepage of the dragnet (URL: www.lagou.com/).

Enter [python] in the input box to search, you can find the following URL:

www.lagou.com/jobs/list_p…

Then, we try again to enter [crawler] in the input box to search, and the webpage jumps to the following URL:

www.lagou.com/jobs/list_ to climb…

From this, it is not difficult to find the rule, and the following results can be obtained after generalization of URL (this is also our starting URL) :

www.lagou.com/jobs/list_ {…

Where the argument position is what we entered in the input box (URL encoding required)

(4) Set cookies

Because the dragnet limits the number of unlogged users, after browsing a certain number of pages, the page will automatically jump to the login interface:

At this point, the crawler won’t work properly (this is where the blogger got stuck for a long time, never finding out why).

To solve the above problem, we can use cookies to simulate login

For convenience, you can manually retrieve cookies directly from the browser and then add the cookie information to the Browser

(5) Initialize partial complete code

# initialization
def init(self) :
    Prepare global variables
    self.data = list()
    self.isEnd = False
	Start and initialize the browser
    opt = webdriver.chrome.options.Options()
    opt.set_headless()
    self.browser = webdriver.Chrome(chrome_options = opt)
    self.wait = WebDriverWait(self.browser,10)
    # open the starting URL
    self.position = input('Please enter position:')
    self.browser.get('https://www.lagou.com/jobs/list_' + urllib.parse.quote(self.position) + '? labelWords=&fromSearch=true&suginput=')
    # set the cookie
    cookie = input('Please enter cookie:')
    for item in cookie.split('; '):
        k,v = item.strip().split('=')
        self.browser.add_cookie({'name':k,'value':v})
Copy the code

2. Crawl data

In this section, we need to do the following two things:

  1. Crawl web page data
  2. Turn pages

(1) Crawl web page data

In the start page, we have the job information we need (which can be matched using xpath) :

  • Links://a[@class="position_link"]
  • Position://a[@class="position_link"]/h3
  • City://a[@class="position_link"]/span/em
  • Monthly salary, Experience and Education://div[@class="p_bot"]/div[@class="li_b_l"]
  • Company Name://div[@class="company_name"]/a

Here, we need to use the try-exception-else exception handling mechanism to handle exceptions to ensure the robustness of the program

(2) Turn pages

We simulated clicking the “Next page” button to turn the page

Here, too, we need to use try-exception-else to handle exceptions

(3) Complete code of crawling data part

# crawl web data
def parse_page(self) :
    try:
        # link
        link = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]')))
        link = [item.get_attribute('href') for item in link]
        # position
        position = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]/h3')))
        position = [item.text for item in position]
        # city
        city = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]/span/em')))
        city = [item.text for item in city]
        Monthly salary, experience and education
        ms_we_eb = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="p_bot"]/div[@class="li_b_l"]')))
        monthly_salary = [item.text.split('/') [0].strip().split(' ') [0] for item in ms_we_eb]
        working_experience = [item.text.split('/') [0].strip().split(' ') [1] for item in ms_we_eb]
        educational_background = [item.text.split('/') [1].strip() for item in ms_we_eb]
        # Company name
        company_name = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="company_name"]/a')))
        company_name = [item.text for item in company_name]
    except TimeoutException:
        self.isEnd = True
    except StaleElementReferenceException:
        time.sleep(3)
        self.parse_page()
    else:
        temp = list(map(lambda a,b,c,d,e,f,g: {'link':a,'position':b,'city':c,'monthly_salary':d,'working_experience':e,'educational_background':f,'company_name':g}, link, position, city, monthly_salary, working_experience, educational_background, company_name))
        self.data.extend(temp)

# Turn the page
def turn_page(self) :
    try:
        pager_next = self.wait.until(EC.element_to_be_clickable((By.CLASS_NAME,'pager_next')))
    except TimeoutException:
        self.isEnd = True
    else:
        pager_next.click()
        time.sleep(3)

Call parse_page and turn_page methods
def crawl(self) :
    count = 0
    while not self.isEnd :
        count += 1
        print('Climbing to number one' + str(count) + 'page... ')
        self.parse_page()
        self.turn_page()
    print('End of climb')
Copy the code

3. Save data

Next, we store the data in a JSON file

Save the data to a file
def save(self) :
    with open('lagou.json'.'w',encoding='utf-8') as f:
        for item in self.data:
            json.dump(item,f,ensure_ascii=False)
Copy the code

There are two things to note here:

  • In the use ofopen()Function, you need to add argumentsencoding='utf-8'
  • In the use ofdump()Function, you need to add argumentsensure_ascii=False

4. Data visualization

Data visualization is conducive to more intuitive display of the relationship between data. According to the data extracted, we can draw the following four histograms:

  • Work Experience – Number of positions
  • Work Experience – Average monthly salary
  • Education – Number of positions
  • Education – Average monthly salary

Here, we need to use the matplotlib library, need to pay attention to a Chinese encoding problem, can be solved by using the following statement:

plt.rcParams['font.sans-serif'] = ['SimHei']

# Data Visualization
def draw(self) :
    count_we = {'No limit to experience':0.'Experienced Fresh Graduate':0.'Less than 1 year of experience':0.'1-3 years experience':0.'3-5 years experience':0.'5-10 years experience':0}
    total_we = {'No limit to experience':0.'Experienced Fresh Graduate':0.'Less than 1 year of experience':0.'1-3 years experience':0.'3-5 years experience':0.'5-10 years experience':0}
    count_eb = {'不限':0.'college':0.'bachelor':0.'master':0.'博士':0}
    total_eb = {'不限':0.'college':0.'bachelor':0.'master':0.'博士':0}
    for item in self.data:
        count_we[item['working_experience']] + =1
        count_eb[item['educational_background']] + =1
        try:
            li = [float(temp.replace('k'.'000')) for temp in item['monthly_salary'].split(The '-')]
            total_we[item['working_experience']] + =sum(li) / len(li)
            total_eb[item['educational_background']] + =sum(li) / len(li)
        except:
            count_we[item['working_experience'- =]]1
            count_eb[item['educational_background'- =]]1
    # Solve the problem of Chinese encoding
    plt.rcParams['font.sans-serif'] = ['SimHei']
    # Work Experience - Number of positions
    plt.title(self.position)
    plt.xlabel('Work Experience')
    plt.ylabel('Number of posts')
    x = ['No limit to experience'.'Experienced Fresh Graduate'.'1-3 years experience'.'3-5 years experience'.'5-10 years experience']
    y = [count_we[item] for item in x]
    plt.bar(x,y)
    plt.show()
    # Work Experience - Average monthly salary
    plt.title(self.position)
    plt.xlabel('Work Experience')
    plt.ylabel('Average monthly salary')
    x = list()
    y = list(a)for item in ['No limit to experience'.'Experienced Fresh Graduate'.'1-3 years experience'.'3-5 years experience'.'5-10 years experience'] :ifcount_we[item] ! =0:
            x.append(item)
            y.append(total_we[item]/count_we[item])
    plt.bar(x,y)
    plt.show()
    # Education - Number of positions
    plt.title(self.position)
    plt.xlabel('degree')
    plt.ylabel('Number of posts')
    x = ['不限'.'college'.'bachelor'.'master'.'博士']
    y = [count_eb[item] for item in x]
    plt.bar(x,y)
    plt.show()
    Education - Average monthly salary
    plt.title(self.position)
    plt.xlabel('degree')
    plt.ylabel('Average monthly salary')
    x = list()
    y = list(a)for item in ['不限'.'college'.'bachelor'.'master'.'博士'] :ifcount_eb[item] ! =0:
            x.append(item)
            y.append(total_eb[item]/count_eb[item])
    plt.bar(x,y)
    plt.show()
Copy the code

You’re done

(1) Complete code

So far, the whole crawler process has been analyzed, and the complete code is as follows:

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException
import urllib.parse
import time
import json
import matplotlib.pyplot as plt

class Lagou:
    # initialization
    def init(self) :
        self.data = list()
        self.isEnd = False
        opt = webdriver.chrome.options.Options()
        opt.set_headless()
        self.browser = webdriver.Chrome(chrome_options = opt)
        self.wait = WebDriverWait(self.browser,10)
        self.position = input('Please enter position:')
        self.browser.get('https://www.lagou.com/jobs/list_' + urllib.parse.quote(self.position) + '? labelWords=&fromSearch=true&suginput=')
        cookie = input('Please enter cookie:')
        for item in cookie.split('; '):
            k,v = item.strip().split('=')
            self.browser.add_cookie({'name':k,'value':v})

    # crawl web data
    def parse_page(self) :
        try:
            link = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]')))
            link = [item.get_attribute('href') for item in link]
            position = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]/h3')))
            position = [item.text for item in position]
            city = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]/span/em')))
            city = [item.text for item in city]
            ms_we_eb = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="p_bot"]/div[@class="li_b_l"]')))
            monthly_salary = [item.text.split('/') [0].strip().split(' ') [0] for item in ms_we_eb]
            working_experience = [item.text.split('/') [0].strip().split(' ') [1] for item in ms_we_eb]
            educational_background = [item.text.split('/') [1].strip() for item in ms_we_eb]
            company_name = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="company_name"]/a')))
            company_name = [item.text for item in company_name]
        except TimeoutException:
            self.isEnd = True
        except StaleElementReferenceException:
            time.sleep(3)
            self.parse_page()
        else:
            temp = list(map(lambda a,b,c,d,e,f,g: {'link':a,'position':b,'city':c,'monthly_salary':d,'working_experience':e,'educational_background':f,'company_name':g}, link, position, city, monthly_salary, working_experience, educational_background, company_name))
            self.data.extend(temp)

    # Turn the page
    def turn_page(self) :
        try:
            pager_next = self.wait.until(EC.element_to_be_clickable((By.CLASS_NAME,'pager_next')))
        except TimeoutException:
            self.isEnd = True
        else:
            pager_next.click()
            time.sleep(3)

    # crawl data
    def crawl(self) :
        count = 0
        while not self.isEnd :
            count += 1
            print('Climbing to number one' + str(count) + 'page... ')
            self.parse_page()
            self.turn_page()
        print('End of climb')

    # Save data
    def save(self) :
        with open('lagou.json'.'w',encoding='utf-8') as f:
            for item in self.data:
                json.dump(item,f,ensure_ascii=False)
    
    # Data Visualization
    def draw(self) :
        count_we = {'No limit to experience':0.'Experienced Fresh Graduate':0.'Less than 1 year of experience':0.'1-3 years experience':0.'3-5 years experience':0.'5-10 years experience':0}
        total_we = {'No limit to experience':0.'Experienced Fresh Graduate':0.'Less than 1 year of experience':0.'1-3 years experience':0.'3-5 years experience':0.'5-10 years experience':0}
        count_eb = {'不限':0.'college':0.'bachelor':0.'master':0.'博士':0}
        total_eb = {'不限':0.'college':0.'bachelor':0.'master':0.'博士':0}
        for item in self.data:
            count_we[item['working_experience']] + =1
            count_eb[item['educational_background']] + =1
            try:
                li = [float(temp.replace('k'.'000')) for temp in item['monthly_salary'].split(The '-')]
                total_we[item['working_experience']] + =sum(li) / len(li)
                total_eb[item['educational_background']] + =sum(li) / len(li)
            except:
                count_we[item['working_experience'- =]]1
                count_eb[item['educational_background'- =]]1
        # Solve the problem of Chinese encoding
        plt.rcParams['font.sans-serif'] = ['SimHei']
        # Work Experience - Number of positions
        plt.title(self.position)
        plt.xlabel('Work Experience')
        plt.ylabel('Number of posts')
        x = ['No limit to experience'.'Experienced Fresh Graduate'.'1-3 years experience'.'3-5 years experience'.'5-10 years experience']
        y = [count_we[item] for item in x]
        plt.bar(x,y)
        plt.show()
        # Work Experience - Average monthly salary
        plt.title(self.position)
        plt.xlabel('Work Experience')
        plt.ylabel('Average monthly salary')
        x = list()
        y = list(a)for item in ['No limit to experience'.'Experienced Fresh Graduate'.'1-3 years experience'.'3-5 years experience'.'5-10 years experience'] :ifcount_we[item] ! =0:
                x.append(item)
                y.append(total_we[item]/count_we[item])
        plt.bar(x,y)
        plt.show()
        # Education - Number of positions
        plt.title(self.position)
        plt.xlabel('degree')
        plt.ylabel('Number of posts')
        x = ['不限'.'college'.'bachelor'.'master'.'博士']
        y = [count_eb[item] for item in x]
        plt.bar(x,y)
        plt.show()
        Education - Average monthly salary
        plt.title(self.position)
        plt.xlabel('degree')
        plt.ylabel('Average monthly salary')
        x = list()
        y = list(a)for item in ['不限'.'college'.'bachelor'.'master'.'博士'] :ifcount_eb[item] ! =0:
                x.append(item)
                y.append(total_eb[item]/count_eb[item])
        plt.bar(x,y)
        plt.show()

if __name__ == '__main__':
    obj = Lagou()
    obj.init()
    obj.crawl()
    obj.save()
    obj.draw()
Copy the code

(2) Operation process

Now, let’s run the code to see!

When running the code, the program will require the input of [position] and [cookie], where the cookie can be obtained as follows:

Go to the homepage and log in

Use the shortcut keys Ctrl+Shift+I or F12 to open developer tools

Enter [position (python in this example)] in the input box for search and packet capture analysis, you can see that cookie information is included in it

The complete operation process is as follows:

(3) Running results