0, preface
The blogger is having trouble choosing his ﹏ < lately and would like to know the current employment prospects for different positions
At this time, might as well write a small crawler, climb pull hook net position data, and graphical method to show, at a glance
The overall idea is to use Selenium to simulate the behavior of the browser, with the following steps:
- Initialize the
- Crawl data, here is divided into two parts: one is to crawl web data, the other is to turn pages
- Save the data to a file
- Data visualization
The overall code structure is as follows:
class Lagou:
# initialization
def init(self) :
pass
# crawl web data
def parse_page(self) :
pass
# Turn the page
def turn_page(self) :
pass
# crawl data, call parse_page and turn_page
def crawl(self) :
pass
Save the data to a file
def save(self) :
pass
# Data Visualization
def draw(self) :
pass
if __name__ == '__main__':
obj = Lagou()
obj.init()
obj.crawl()
obj.save()
obj.draw()
Copy the code
Well, let’s look at the detailed analysis of the whole crawler process!
1. Initialization
In the initialization part, we need to complete the following four aspects:
- Prepare global variables
- Start the browser
- Open the start URL
- Set the cookie
(1) Prepare global variables
The so-called global variables refer to the variables needed in the whole crawler process. Here we define two global variables:
- Data: Stores crawled data
- IsEnd: determines whether the crawl is complete
(2) Start the browser
The way to start a browser can be roughly divided into two kinds, one is normal startup, the second is headless startup
During normal startup, the entire crawl process can be visualized, making it easy to find errors during debugging
from selenium import webdriver
self.browser = webdriver.Chrome()
Copy the code
Headless boot can reduce rendering time and speed up the crawl process, which is usually used for formal crawls
from selenium import webdriver
opt = webdriver.chrome.options.Options()
opt.set_headless()
self.browser = webdriver.Chrome(chrome_options = opt)
Copy the code
(3) Open the start URL
First, we open the homepage of the dragnet (URL: www.lagou.com/).
Enter [python] in the input box to search, you can find the following URL:
Then, we try again to enter [crawler] in the input box to search, and the webpage jumps to the following URL:
www.lagou.com/jobs/list_ to climb…
From this, it is not difficult to find the rule, and the following results can be obtained after generalization of URL (this is also our starting URL) :
www.lagou.com/jobs/list_ {…
Where the argument position is what we entered in the input box (URL encoding required)
(4) Set cookies
Because the dragnet limits the number of unlogged users, after browsing a certain number of pages, the page will automatically jump to the login interface:
At this point, the crawler won’t work properly (this is where the blogger got stuck for a long time, never finding out why).
To solve the above problem, we can use cookies to simulate login
For convenience, you can manually retrieve cookies directly from the browser and then add the cookie information to the Browser
(5) Initialize partial complete code
# initialization
def init(self) :
Prepare global variables
self.data = list()
self.isEnd = False
Start and initialize the browser
opt = webdriver.chrome.options.Options()
opt.set_headless()
self.browser = webdriver.Chrome(chrome_options = opt)
self.wait = WebDriverWait(self.browser,10)
# open the starting URL
self.position = input('Please enter position:')
self.browser.get('https://www.lagou.com/jobs/list_' + urllib.parse.quote(self.position) + '? labelWords=&fromSearch=true&suginput=')
# set the cookie
cookie = input('Please enter cookie:')
for item in cookie.split('; '):
k,v = item.strip().split('=')
self.browser.add_cookie({'name':k,'value':v})
Copy the code
2. Crawl data
In this section, we need to do the following two things:
- Crawl web page data
- Turn pages
(1) Crawl web page data
In the start page, we have the job information we need (which can be matched using xpath) :
- Links:
//a[@class="position_link"]
- Position:
//a[@class="position_link"]/h3
- City:
//a[@class="position_link"]/span/em
- Monthly salary, Experience and Education:
//div[@class="p_bot"]/div[@class="li_b_l"]
- Company Name:
//div[@class="company_name"]/a
Here, we need to use the try-exception-else exception handling mechanism to handle exceptions to ensure the robustness of the program
(2) Turn pages
We simulated clicking the “Next page” button to turn the page
Here, too, we need to use try-exception-else to handle exceptions
(3) Complete code of crawling data part
# crawl web data
def parse_page(self) :
try:
# link
link = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]')))
link = [item.get_attribute('href') for item in link]
# position
position = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]/h3')))
position = [item.text for item in position]
# city
city = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]/span/em')))
city = [item.text for item in city]
Monthly salary, experience and education
ms_we_eb = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="p_bot"]/div[@class="li_b_l"]')))
monthly_salary = [item.text.split('/') [0].strip().split(' ') [0] for item in ms_we_eb]
working_experience = [item.text.split('/') [0].strip().split(' ') [1] for item in ms_we_eb]
educational_background = [item.text.split('/') [1].strip() for item in ms_we_eb]
# Company name
company_name = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="company_name"]/a')))
company_name = [item.text for item in company_name]
except TimeoutException:
self.isEnd = True
except StaleElementReferenceException:
time.sleep(3)
self.parse_page()
else:
temp = list(map(lambda a,b,c,d,e,f,g: {'link':a,'position':b,'city':c,'monthly_salary':d,'working_experience':e,'educational_background':f,'company_name':g}, link, position, city, monthly_salary, working_experience, educational_background, company_name))
self.data.extend(temp)
# Turn the page
def turn_page(self) :
try:
pager_next = self.wait.until(EC.element_to_be_clickable((By.CLASS_NAME,'pager_next')))
except TimeoutException:
self.isEnd = True
else:
pager_next.click()
time.sleep(3)
Call parse_page and turn_page methods
def crawl(self) :
count = 0
while not self.isEnd :
count += 1
print('Climbing to number one' + str(count) + 'page... ')
self.parse_page()
self.turn_page()
print('End of climb')
Copy the code
3. Save data
Next, we store the data in a JSON file
Save the data to a file
def save(self) :
with open('lagou.json'.'w',encoding='utf-8') as f:
for item in self.data:
json.dump(item,f,ensure_ascii=False)
Copy the code
There are two things to note here:
- In the use of
open()
Function, you need to add argumentsencoding='utf-8'
- In the use of
dump()
Function, you need to add argumentsensure_ascii=False
4. Data visualization
Data visualization is conducive to more intuitive display of the relationship between data. According to the data extracted, we can draw the following four histograms:
- Work Experience – Number of positions
- Work Experience – Average monthly salary
- Education – Number of positions
- Education – Average monthly salary
Here, we need to use the matplotlib library, need to pay attention to a Chinese encoding problem, can be solved by using the following statement:
plt.rcParams['font.sans-serif'] = ['SimHei']
# Data Visualization
def draw(self) :
count_we = {'No limit to experience':0.'Experienced Fresh Graduate':0.'Less than 1 year of experience':0.'1-3 years experience':0.'3-5 years experience':0.'5-10 years experience':0}
total_we = {'No limit to experience':0.'Experienced Fresh Graduate':0.'Less than 1 year of experience':0.'1-3 years experience':0.'3-5 years experience':0.'5-10 years experience':0}
count_eb = {'不限':0.'college':0.'bachelor':0.'master':0.'博士':0}
total_eb = {'不限':0.'college':0.'bachelor':0.'master':0.'博士':0}
for item in self.data:
count_we[item['working_experience']] + =1
count_eb[item['educational_background']] + =1
try:
li = [float(temp.replace('k'.'000')) for temp in item['monthly_salary'].split(The '-')]
total_we[item['working_experience']] + =sum(li) / len(li)
total_eb[item['educational_background']] + =sum(li) / len(li)
except:
count_we[item['working_experience'- =]]1
count_eb[item['educational_background'- =]]1
# Solve the problem of Chinese encoding
plt.rcParams['font.sans-serif'] = ['SimHei']
# Work Experience - Number of positions
plt.title(self.position)
plt.xlabel('Work Experience')
plt.ylabel('Number of posts')
x = ['No limit to experience'.'Experienced Fresh Graduate'.'1-3 years experience'.'3-5 years experience'.'5-10 years experience']
y = [count_we[item] for item in x]
plt.bar(x,y)
plt.show()
# Work Experience - Average monthly salary
plt.title(self.position)
plt.xlabel('Work Experience')
plt.ylabel('Average monthly salary')
x = list()
y = list(a)for item in ['No limit to experience'.'Experienced Fresh Graduate'.'1-3 years experience'.'3-5 years experience'.'5-10 years experience'] :ifcount_we[item] ! =0:
x.append(item)
y.append(total_we[item]/count_we[item])
plt.bar(x,y)
plt.show()
# Education - Number of positions
plt.title(self.position)
plt.xlabel('degree')
plt.ylabel('Number of posts')
x = ['不限'.'college'.'bachelor'.'master'.'博士']
y = [count_eb[item] for item in x]
plt.bar(x,y)
plt.show()
Education - Average monthly salary
plt.title(self.position)
plt.xlabel('degree')
plt.ylabel('Average monthly salary')
x = list()
y = list(a)for item in ['不限'.'college'.'bachelor'.'master'.'博士'] :ifcount_eb[item] ! =0:
x.append(item)
y.append(total_eb[item]/count_eb[item])
plt.bar(x,y)
plt.show()
Copy the code
You’re done
(1) Complete code
So far, the whole crawler process has been analyzed, and the complete code is as follows:
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException
import urllib.parse
import time
import json
import matplotlib.pyplot as plt
class Lagou:
# initialization
def init(self) :
self.data = list()
self.isEnd = False
opt = webdriver.chrome.options.Options()
opt.set_headless()
self.browser = webdriver.Chrome(chrome_options = opt)
self.wait = WebDriverWait(self.browser,10)
self.position = input('Please enter position:')
self.browser.get('https://www.lagou.com/jobs/list_' + urllib.parse.quote(self.position) + '? labelWords=&fromSearch=true&suginput=')
cookie = input('Please enter cookie:')
for item in cookie.split('; '):
k,v = item.strip().split('=')
self.browser.add_cookie({'name':k,'value':v})
# crawl web data
def parse_page(self) :
try:
link = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]')))
link = [item.get_attribute('href') for item in link]
position = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]/h3')))
position = [item.text for item in position]
city = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//a[@class="position_link"]/span/em')))
city = [item.text for item in city]
ms_we_eb = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="p_bot"]/div[@class="li_b_l"]')))
monthly_salary = [item.text.split('/') [0].strip().split(' ') [0] for item in ms_we_eb]
working_experience = [item.text.split('/') [0].strip().split(' ') [1] for item in ms_we_eb]
educational_background = [item.text.split('/') [1].strip() for item in ms_we_eb]
company_name = self.wait.until(EC.presence_of_all_elements_located((By.XPATH,'//div[@class="company_name"]/a')))
company_name = [item.text for item in company_name]
except TimeoutException:
self.isEnd = True
except StaleElementReferenceException:
time.sleep(3)
self.parse_page()
else:
temp = list(map(lambda a,b,c,d,e,f,g: {'link':a,'position':b,'city':c,'monthly_salary':d,'working_experience':e,'educational_background':f,'company_name':g}, link, position, city, monthly_salary, working_experience, educational_background, company_name))
self.data.extend(temp)
# Turn the page
def turn_page(self) :
try:
pager_next = self.wait.until(EC.element_to_be_clickable((By.CLASS_NAME,'pager_next')))
except TimeoutException:
self.isEnd = True
else:
pager_next.click()
time.sleep(3)
# crawl data
def crawl(self) :
count = 0
while not self.isEnd :
count += 1
print('Climbing to number one' + str(count) + 'page... ')
self.parse_page()
self.turn_page()
print('End of climb')
# Save data
def save(self) :
with open('lagou.json'.'w',encoding='utf-8') as f:
for item in self.data:
json.dump(item,f,ensure_ascii=False)
# Data Visualization
def draw(self) :
count_we = {'No limit to experience':0.'Experienced Fresh Graduate':0.'Less than 1 year of experience':0.'1-3 years experience':0.'3-5 years experience':0.'5-10 years experience':0}
total_we = {'No limit to experience':0.'Experienced Fresh Graduate':0.'Less than 1 year of experience':0.'1-3 years experience':0.'3-5 years experience':0.'5-10 years experience':0}
count_eb = {'不限':0.'college':0.'bachelor':0.'master':0.'博士':0}
total_eb = {'不限':0.'college':0.'bachelor':0.'master':0.'博士':0}
for item in self.data:
count_we[item['working_experience']] + =1
count_eb[item['educational_background']] + =1
try:
li = [float(temp.replace('k'.'000')) for temp in item['monthly_salary'].split(The '-')]
total_we[item['working_experience']] + =sum(li) / len(li)
total_eb[item['educational_background']] + =sum(li) / len(li)
except:
count_we[item['working_experience'- =]]1
count_eb[item['educational_background'- =]]1
# Solve the problem of Chinese encoding
plt.rcParams['font.sans-serif'] = ['SimHei']
# Work Experience - Number of positions
plt.title(self.position)
plt.xlabel('Work Experience')
plt.ylabel('Number of posts')
x = ['No limit to experience'.'Experienced Fresh Graduate'.'1-3 years experience'.'3-5 years experience'.'5-10 years experience']
y = [count_we[item] for item in x]
plt.bar(x,y)
plt.show()
# Work Experience - Average monthly salary
plt.title(self.position)
plt.xlabel('Work Experience')
plt.ylabel('Average monthly salary')
x = list()
y = list(a)for item in ['No limit to experience'.'Experienced Fresh Graduate'.'1-3 years experience'.'3-5 years experience'.'5-10 years experience'] :ifcount_we[item] ! =0:
x.append(item)
y.append(total_we[item]/count_we[item])
plt.bar(x,y)
plt.show()
# Education - Number of positions
plt.title(self.position)
plt.xlabel('degree')
plt.ylabel('Number of posts')
x = ['不限'.'college'.'bachelor'.'master'.'博士']
y = [count_eb[item] for item in x]
plt.bar(x,y)
plt.show()
Education - Average monthly salary
plt.title(self.position)
plt.xlabel('degree')
plt.ylabel('Average monthly salary')
x = list()
y = list(a)for item in ['不限'.'college'.'bachelor'.'master'.'博士'] :ifcount_eb[item] ! =0:
x.append(item)
y.append(total_eb[item]/count_eb[item])
plt.bar(x,y)
plt.show()
if __name__ == '__main__':
obj = Lagou()
obj.init()
obj.crawl()
obj.save()
obj.draw()
Copy the code
(2) Operation process
Now, let’s run the code to see!
When running the code, the program will require the input of [position] and [cookie], where the cookie can be obtained as follows:
Go to the homepage and log in
Use the shortcut keys Ctrl+Shift+I or F12 to open developer tools
Enter [position (python in this example)] in the input box for search and packet capture analysis, you can see that cookie information is included in it
The complete operation process is as follows: