“This is the 21st day of my participation in the Gwen Challenge in November. See details of the event: The Last Gwen Challenge 2021”.

preface

Using Python to achieve visualization of BOOS direct employment & Pull hook net job data. Without further ado.

Let’s have a good time

The development tools

Python version: 3.6.4

Related modules:

Requests the module

Pyspider module;

Pymysql module;

Pytesseract module;

The random module;

Re module

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

This time through the BOSS direct hire, pull hook net data analysis post data analysis, understand the industry situation of data analysis post

Web analytics

Obtain the information of BOSS direct employment index page, mainly including post name, salary, location, working years, education requirements, company name, type, status and scale.

At the beginning, I want to analyze the detail page, and I can also obtain the job content and job skill requirements in the detail page.

And then because of too many requests, gave up. The index page is 10 pages long, each page has 30 posts, and each detail page requires one request, which adds up to 300 requests.

On page 2 (60 requests), there is a warning about too many visits.

For index page information, there are only 10 requests, which is basically no problem, plus you don’t want to tamper with proxy IP, so let’s do something simple.

When it comes to data mining, see if slowing down can be successful.

Get the index page information, mainly job name, location, salary, working years, education requirements, company name, type, status, size, job skills, job benefits.

The page is an Ajax request, and the code is written using PyCharm.

Data acquisition

Pyspider gets BOSS direct data

Pyspider is easy to install from pip3 on the command line.

The pySpider docking PhantomJS(which handles JavaScript rendered pages) was not installed before.

So you need to download its EXE file from the website and put it in the same folder as Python’s EXE file.

Finally, type PySpider All on the command line to run PySpider.

Open the url http://localhost:5000/ in the browser, create the project, add the project name, enter the request URL, and get the following picture.

Finally, I wrote the code in the PySpider script editor and corrected it with the feedback on the left.

Part of the script editor code is as follows

from pyspider.libs.base_handler import *
import pymysql
import random
import time
import re

count = 0

class Handler(BaseHandler) :
    # add request header, otherwise 403 error
    crawl_config = {'headers': {'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}}

    def __init__(self) :
        Connect to database
        self.db = pymysql.connect(host='127.0.0.1', user='root', password='774110919', port=3306, db='boss_job', charset='utf8mb4')

    def add_Mysql(self, id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people) :
        Write data to the database
        try:
            cursor = self.db.cursor()
            sql = 'insert into job(id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people) values ("%d", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s", "%s")' % (id, job_title, job_salary, job_city, job_experience, job_education, company_name, company_type, company_status, company_people);
            print(sql)
            cursor.execute(sql)
            print(cursor.lastrowid)
            self.db.commit()
        except Exception as e:
            print(e)
            self.db.rollback()

    @every(minutes=24 * 60)
    def on_start(self) :
       
Copy the code

Obtain BOSS direct hire data analysis post data

PyCharm Obtains pull data

Part of the code

import requests
import pymysql
import random
import time
import json

count = 0
Set request url and request header parameters
url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
headers = {
    'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'.'Cookie': 'Your Cookie value'.'Accept': 'application/json, text/javascript, */*; Q = 0.01 '.'Connection': 'keep-alive'.'Host': 'www.lagou.com'.'Origin': 'https://www.lagou.com'.'Referer': 'ttps://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?labelWords=sug&fromSearch=true&suginput=shuju'
}


if __name__ == '__main__':
    get_message()
Copy the code

Obtain the data of pull – hook data analysis post

Data visualization

City map

Urban distribution heat map

Experience salary chart

Here, by looking at the quartiles and median of the box chart, you can roughly see that salaries go up with years of service.

BOSS direct hire, within 1 year of work experience salary, there is a maximum of more than 40,000, this is certainly not reasonable.

So I went to the database to look, in fact, the post requirements are more than 3 years, but the actual label is less than 1 year.

So the accuracy of the data provided by the data source is very important.

Education salary chart

Overall “master” > “undergraduate course” > “junior college”, of course there is also high salary in junior college, undergraduate course.

After all, ability becomes more and more important later on, and a degree is an important plus.

Company status salary chart

Company size salary chart

Normally, the bigger the company, the higher the salary.

After all, the wages of the big factories are there, so it’s hard to know.

TOP10 companies by type

Data analysis positions are mainly concentrated in the Internet industry, and “finance”, “real estate”, “education”, “medical” and “games” are also involved.

Job Skill chart

Cloud of job benefits words

Here we can see that most of the focus is around the “five social insurance and one housing fund”, “more benefits”, “good team atmosphere”, “large promotion space” and “industry leading bull”.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python crawler combat, Pytesseract module, Python to achieve BOOS direct employment & pull hook net job data visualization

preface

The development tools

Environment set up

Web analytics

Data acquisition

Pyspider gets BOSS direct data

PyCharm Obtains pull data

Data visualization

City map

Urban distribution heat map

Experience salary chart

Education salary chart

Company status salary chart

Company size salary chart

TOP10 companies by type

Job Skill chart

Cloud of job benefits words

Python crawler combat, Pytesseract module, Python to achieve BOOS direct employment & pull hook net job data visualization

preface

The development tools

Environment set up

Web analytics

Data acquisition

Pyspider gets BOSS direct data

PyCharm Obtains pull data

Data visualization

City map

Urban distribution heat map

Experience salary chart

Education salary chart

Company status salary chart

Company size salary chart

TOP10 companies by type

Job Skill chart

Cloud of job benefits words

Related Posts

How to build fresh B2B technology system from 0 to 1 (B2B technology sharing fifth chapter)

DP cut tape to solve the problem

“The Power of Clickhouse Array” 2-2