I crawled crawler post salary, analysis found crawler really sweet!

Wandering around the recruitment website, I came across the salary of the crawler position and found it really fragrant. Today I decided to climb down and analyze it

PS: If you need Python learning materials, please click on the link below to obtain them

Free Python learning materials and group communication solutions click to join

First, determine the target website:

https://jobs.51job.com/pachongkaifa
Copy the code

1. Start

Open PyCharm and create a new file -> Import the required libraries -> add the usual request headers

# import requests package
import requests
from lxml import etree
# web links
Url = “jobs.51job.com/pachongkaif…”
# request header
headers = {
“Accept”: “text/html,application/xhtml+xml,application/xml; Q = 0.9, image/avif, image/webp image/apng, * / *; Q = 0.8, application/signed – exchange; v=b3; Q = 0.9 “,
“Accept-Encoding”: “gzip, deflate, br”,
“Accept-Language”: “zh-CN,zh; Q = 0.9 “,
“Connection”: “keep-alive”,
“Cookie”: “guid=7e8a970a750a4e74ce237e74ba72856b; partner=blog_csdn_net”,
“Host”: “jobs.51job.com”,
“Sec-Fetch-Dest”: “document”,
“Sec-Fetch-Mode”: “navigate”,
“Sec-Fetch-Site”: “none”,
“Sec-Fetch-User”: “? 1 “,
“Upgrade-Insecure-Requests”: “1”,
“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36”
}

2. Analyze the labels of the target website and find that the desired fields (post, company name, city and salary) are all in the P tag, as shown in the figure below

<p class="info">
Copy the code

3. Start coding

Request web pages first to prevent Chinese garbled characters and conduct GBK encoding (garbled characters will appear if not set)

res = requests.get(url=url, headers=headers)
res.encoding=’gbk’
s = res.text

, and then parse the web page to get the content you want

selector = etree.HTML(s)
for item in selector.xpath(‘/html/body/div[4]/div[2]/div[1]/div/div’):
title = item.xpath(‘.//p/span[@class=”title”]/a/text()’)
name = item.xpath(‘.//p/a/@title’)
location_name = item.xpath(‘.//p/span[@class=”location name”]/text()’)
sary = item.xpath(‘.//p/span[@class=”location”]/text()’)
time = item.xpath(‘.//p/span[@class=”time”]/text()’)
if len(title)>0:
print(title)
print(name)
print(location_name)
print(sary)
print(time)
print(“———–“)

The result is as follows:

4. Save the file to a CSV file

In order to facilitate our data analysis in the next step, I stored the extracted data in a CSV file

Import the required library packages

import csv
import codecs

Create a CSV file and set it to appending mode

F = codecs.open(‘ crawler engineer position salary. CSV ‘,’a’,’ GBK ‘)
writer = csv.writer(f)
Writerow ([” position “,” company “,” city “,” salary “])

Write the contents of the file to the CSV loop while it is being crawled

writer.writerow([title[0]+"",name[0]+"",location_name[0]+"",sary[0]+""])
Copy the code

The saved CSV data is as follows:

5. Analyze the data and visualize it

Read the crawled data from CSV

With open(‘ crawler engineer position salary. CSV ‘,’r’,encoding = ‘GBK ‘) as fp:
reader = csv.reader(fp)
for row in reader:
# jobs
title_list.append(row[0])
# city
city_list.append(row[2][0:2])
# Salary distribution
sary = row[3].split(“-“)
if(len(sary)==2):
try:
Sary = sary[1].replace(“/月”,””)
If “10000” in sary:
Sary = sary. Replace (” 文 “,””)
sary = int(sary)
sary = sary*10000
sary_list.append(sary)
If “1000” in sary:
Sary = sary. Replace (” 1000 “,””)
sary = int(sary)
sary = sary * 1000
sary_list.append(sary)
except:
pass

Three sets are used to store the contents of the system analysis (job, city, salary distribution)

# jobs
title_list=[]
# city
city_list=[]
# Salary distribution
sary_list=[]

Since the salary is 10,000 yuan/month and 20,000 yuan/month, corresponding processing is needed in order to convert to 10,000 yuan/month and 20,000 yuan/month.

Start analyzing

5.1. Visualization 1: Common names of crawler posts

dict_x = {}
for item in title_list:
dict_x[item] = title_list.count(item)
sorted_x = sorted(dict_x.items(), key=operator.itemgetter(1), reverse=True)
k_list = []
v_list = []
for k, v in sorted_x[0:11]:
k_list.append(k)
v_list.append(v)
plt.axes(aspect=1)
plt.pie(x=v_list,labels= k_list,autopct=’%0f%%’)
Plt. savefig(” common name of crawler post. PNG “, dpi=600)
plt.show()

As you can see, most companies need to use the term “crawler developer”

5.2. Visualization 2: Cities with the most crawler jobs

dict_x = {}
for item in city_list:
dict_x[item] = city_list.count(item)
sorted_x = sorted(dict_x.items(), key=operator.itemgetter(1), reverse=True)
k_list = []
v_list = []
for k, v in sorted_x[0:11]:
print(k, v)
k_list.append(k)
v_list.append(v)
Plt. bar(k_list,v_list, label=’ most crawler jobs ‘)
plt.legend()
PLT. Xlabel (‘ city ‘)
PLT. Ylabel (‘ number ‘)
Plt.title (u’ most crawler jobs in city (li Yuchen)’)
Plt.savefig (” city with most crawler jobs. PNG “, dpi=600)
plt.show()

As can be seen from the figure, there are more jobs of reptile engineers in big cities (Beijing, Shanghai, Guangzhou and Shenzhen)

5.3. Visualization 3: Salary distribution

dict_x = {}
for item in sary_list:
dict_x[item] = sary_list.count(item)
sorted_x = sorted(dict_x.items(), key=operator.itemgetter(1), reverse=True)
k_list = []
v_list = []
for k, v in sorted_x[0:15]:
print(k, v)
k_list.append(k)
v_list.append(v)
plt.axes(aspect=1)
Plt.title (u’ salary distribution ‘) plt.title(u’ salary distribution ‘)
plt.pie(x=v_list, labels=k_list, autopct=’%0f%%’)
Plt.savefig (” salary distribution. PNG “, dpi=600)
plt.show()

We can find that the salary of reptile engineers is more than 20000, accounting for half, especially around 20000. It seems that the reptile job is really delicious, are you sour? Haha

data = pd.DataFrame({“value”:sary_list})
cats1 = pd.cut(data[‘value’].values, bins=[8000, 10000, 20000, 30000, 50000,data[‘value’].max()+1])
pinshu = cats1.value_counts()
Pinshu_df = pd.dataframe (pinshu, columns=[‘ frequency ‘])
Pinshu_df [‘ frequency f] = pinshu_df/pinshu_df [‘ frequency ‘] sum ()
Pinshu_df [‘ frequency %] = pinshu_df [‘ frequency f] map (lambda x: ‘%. 2 f % % % (* 100) x)
Pinshu_df [‘ f’] = pinshu_df[‘ f’].cumsum()
Pinshu_df [‘ cumulative frequency %] = pinshu_df [‘ cumulative frequency f] map (lambda x: ‘%. 4 f % % % (* 100) x)
print(pinshu_df)
print()
Print (” print “)

From the salary range, between 10000-20000 stand most, basic very good salary, more than 20000+ have a few, is really too big temptation

Ok, this is the end of today’s sharing, we will see you next time