“This is the 21st day of my participation in the Gwen Challenge in November. See details of the event: The Last Gwen Challenge 2021”.
preface
Using Python to achieve visualization of BOOS direct employment & Pull hook net job data. Without further ado.
Let’s have a good time
The development tools
Python version: 3.6.4
Related modules:
Requests the module
Pyspider module;
Pymysql module;
Pytesseract module;
The random module;
Re module
And some modules that come with Python.
Environment set up
Install Python and add it to the environment variables. PIP installs the required related modules.
This time through the BOSS direct hire, pull hook net data analysis post data analysis, understand the industry situation of data analysis post
Web analytics
Obtain the information of BOSS direct employment index page, mainly including post name, salary, location, working years, education requirements, company name, type, status and scale.
At the beginning, I want to analyze the detail page, and I can also obtain the job content and job skill requirements in the detail page.
And then because of too many requests, gave up. The index page is 10 pages long, each page has 30 posts, and each detail page requires one request, which adds up to 300 requests.
On page 2 (60 requests), there is a warning about too many visits.
For index page information, there are only 10 requests, which is basically no problem, plus you don’t want to tamper with proxy IP, so let’s do something simple.
When it comes to data mining, see if slowing down can be successful.
Get the index page information, mainly job name, location, salary, working years, education requirements, company name, type, status, size, job skills, job benefits.
The page is an Ajax request, and the code is written using PyCharm.
Data acquisition
Pyspider gets BOSS direct data
Pyspider is easy to install from pip3 on the command line.
The pySpider docking PhantomJS(which handles JavaScript rendered pages) was not installed before.
So you need to download its EXE file from the website and put it in the same folder as Python’s EXE file.
Finally, type PySpider All on the command line to run PySpider.
Open the url http://localhost:5000/ in the browser, create the project, add the project name, enter the request URL, and get the following picture.
Finally, I wrote the code in the PySpider script editor and corrected it with the feedback on the left.
Part of the script editor code is as follows
from pyspider.libs.base_handler import *
import pymysql
import random
import time
import re
count = 0
class Handler(BaseHandler) :
# add request header, otherwise 403 error
crawl_config = {'headers': {'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}}
def __init__(self) :
Connect to database
self.db = pymysql.connect(host='127.0.0.1', user='root', password='774110919', port=3306, db='boss_job', charset='utf8mb4')
def add_Mysql(self, id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people) :
Write data to the database
try:
cursor = self.db.cursor()
sql = 'insert into job(id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people) values ("%d", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s")' % (id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people);
print(sql)
cursor.execute(sql)
print(cursor.lastrowid)
self.db.commit()
except Exception as e:
print(e)
self.db.rollback()
@every(minutes=24 * 60)
def on_start(self) :
Copy the code
Obtain BOSS direct hire data analysis post data
PyCharm Obtains pull data
Part of the code
import requests
import pymysql
import random
import time
import json
count = 0
Set request url and request header parameters
url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
headers = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'.'Cookie': 'Your Cookie value'.'Accept': 'application/json, text/javascript, */*; Q = 0.01 '.'Connection': 'keep-alive'.'Host': 'www.lagou.com'.'Origin': 'https://www.lagou.com'.'Referer': 'ttps://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?labelWords=sug&fromSearch=true&suginput=shuju'
}
if __name__ == '__main__':
get_message()
Copy the code
Obtain the data of pull – hook data analysis post
Data visualization
City map
Urban distribution heat map
Experience salary chart
Here, by looking at the quartiles and median of the box chart, you can roughly see that salaries go up with years of service.
BOSS direct hire, within 1 year of work experience salary, there is a maximum of more than 40,000, this is certainly not reasonable.
So I went to the database to look, in fact, the post requirements are more than 3 years, but the actual label is less than 1 year.
So the accuracy of the data provided by the data source is very important.
Education salary chart
Overall “master” > “undergraduate course” > “junior college”, of course there is also high salary in junior college, undergraduate course.
After all, ability becomes more and more important later on, and a degree is an important plus.
Company status salary chart
Company size salary chart
Normally, the bigger the company, the higher the salary.
After all, the wages of the big factories are there, so it’s hard to know.
TOP10 companies by type
Data analysis positions are mainly concentrated in the Internet industry, and “finance”, “real estate”, “education”, “medical” and “games” are also involved.
Job Skill chart
Cloud of job benefits words
Here we can see that most of the focus is around the “five social insurance and one housing fund”, “more benefits”, “good team atmosphere”, “large promotion space” and “industry leading bull”.