Wandering around the recruitment website, I came across the salary of the crawler position and found it really fragrant. Today I decided to climb down and analyze it
PS: If you need Python learning materials, please click on the link below to obtain them
Free Python learning materials and group communication solutions click to join
First, determine the target website:
https://jobs.51job.com/pachongkaifa
Copy the code
1. Start
Open PyCharm and create a new file -> Import the required libraries -> add the usual request headers
-
# import requests package
-
import requests
-
from lxml import etree
-
# web links
-
Url = “jobs.51job.com/pachongkaif…”
-
# request header
-
headers = {
-
“Accept”: “text/html,application/xhtml+xml,application/xml; Q = 0.9, image/avif, image/webp image/apng, * / *; Q = 0.8, application/signed – exchange; v=b3; Q = 0.9 “,
-
“Accept-Encoding”: “gzip, deflate, br”,
-
“Accept-Language”: “zh-CN,zh; Q = 0.9 “,
-
“Connection”: “keep-alive”,
-
“Cookie”: “guid=7e8a970a750a4e74ce237e74ba72856b; partner=blog_csdn_net”,
-
“Host”: “jobs.51job.com”,
-
“Sec-Fetch-Dest”: “document”,
-
“Sec-Fetch-Mode”: “navigate”,
-
“Sec-Fetch-Site”: “none”,
-
“Sec-Fetch-User”: “? 1 “,
-
“Upgrade-Insecure-Requests”: “1”,
-
“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36”
-
}
2. Analyze the labels of the target website and find that the desired fields (post, company name, city and salary) are all in the P tag, as shown in the figure below
<p class="info">
Copy the code
3. Start coding
Request web pages first to prevent Chinese garbled characters and conduct GBK encoding (garbled characters will appear if not set)
-
res = requests.get(url=url, headers=headers)
-
res.encoding=’gbk’
-
s = res.text
, and then parse the web page to get the content you want
-
selector = etree.HTML(s)
-
for item in selector.xpath(‘/html/body/div[4]/div[2]/div[1]/div/div’):
-
title = item.xpath(‘.//p/span[@class=”title”]/a/text()’)
-
name = item.xpath(‘.//p/a/@title’)
-
location_name = item.xpath(‘.//p/span[@class=”location name”]/text()’)
-
sary = item.xpath(‘.//p/span[@class=”location”]/text()’)
-
time = item.xpath(‘.//p/span[@class=”time”]/text()’)
-
if len(title)>0:
-
print(title)
-
print(name)
-
print(location_name)
-
print(sary)
-
print(time)
-
print(“———–“)
The result is as follows:
4. Save the file to a CSV file
In order to facilitate our data analysis in the next step, I stored the extracted data in a CSV file
Import the required library packages
-
import csv
-
import codecs
Create a CSV file and set it to appending mode
-
F = codecs.open(‘ crawler engineer position salary. CSV ‘,’a’,’ GBK ‘)
-
writer = csv.writer(f)
-
Writerow ([” position “,” company “,” city “,” salary “])
Write the contents of the file to the CSV loop while it is being crawled
writer.writerow([title[0]+"",name[0]+"",location_name[0]+"",sary[0]+""])
Copy the code
The saved CSV data is as follows:
5. Analyze the data and visualize it
Read the crawled data from CSV
-
With open(‘ crawler engineer position salary. CSV ‘,’r’,encoding = ‘GBK ‘) as fp:
-
reader = csv.reader(fp)
-
for row in reader:
-
# jobs
-
title_list.append(row[0])
-
# city
-
city_list.append(row[2][0:2])
-
# Salary distribution
-
sary = row[3].split(“-“)
-
if(len(sary)==2):
-
try:
-
Sary = sary[1].replace(“/月”,””)
-
If “10000” in sary:
-
Sary = sary. Replace (” 文 “,””)
-
sary = int(sary)
-
sary = sary*10000
-
sary_list.append(sary)
-
If “1000” in sary:
-
Sary = sary. Replace (” 1000 “,””)
-
sary = int(sary)
-
sary = sary * 1000
-
sary_list.append(sary)
-
except:
-
pass
Three sets are used to store the contents of the system analysis (job, city, salary distribution)
-
# jobs
-
title_list=[]
-
# city
-
city_list=[]
-
# Salary distribution
-
sary_list=[]
Since the salary is 10,000 yuan/month and 20,000 yuan/month, corresponding processing is needed in order to convert to 10,000 yuan/month and 20,000 yuan/month.
Start analyzing
5.1. Visualization 1: Common names of crawler posts
-
dict_x = {}
-
for item in title_list:
-
dict_x[item] = title_list.count(item)
-
sorted_x = sorted(dict_x.items(), key=operator.itemgetter(1), reverse=True)
-
k_list = []
-
v_list = []
-
for k, v in sorted_x[0:11]:
-
k_list.append(k)
-
v_list.append(v)
-
plt.axes(aspect=1)
-
plt.pie(x=v_list,labels= k_list,autopct=’%0f%%’)
-
Plt. savefig(” common name of crawler post. PNG “, dpi=600)
-
plt.show()
As you can see, most companies need to use the term “crawler developer”
5.2. Visualization 2: Cities with the most crawler jobs
-
dict_x = {}
-
for item in city_list:
-
dict_x[item] = city_list.count(item)
-
sorted_x = sorted(dict_x.items(), key=operator.itemgetter(1), reverse=True)
-
k_list = []
-
v_list = []
-
for k, v in sorted_x[0:11]:
-
print(k, v)
-
k_list.append(k)
-
v_list.append(v)
-
Plt. bar(k_list,v_list, label=’ most crawler jobs ‘)
-
plt.legend()
-
PLT. Xlabel (‘ city ‘)
-
PLT. Ylabel (‘ number ‘)
-
Plt.title (u’ most crawler jobs in city (li Yuchen)’)
-
Plt.savefig (” city with most crawler jobs. PNG “, dpi=600)
-
plt.show()
As can be seen from the figure, there are more jobs of reptile engineers in big cities (Beijing, Shanghai, Guangzhou and Shenzhen)
5.3. Visualization 3: Salary distribution
-
dict_x = {}
-
for item in sary_list:
-
dict_x[item] = sary_list.count(item)
-
sorted_x = sorted(dict_x.items(), key=operator.itemgetter(1), reverse=True)
-
k_list = []
-
v_list = []
-
for k, v in sorted_x[0:15]:
-
print(k, v)
-
k_list.append(k)
-
v_list.append(v)
-
plt.axes(aspect=1)
-
Plt.title (u’ salary distribution ‘) plt.title(u’ salary distribution ‘)
-
plt.pie(x=v_list, labels=k_list, autopct=’%0f%%’)
-
Plt.savefig (” salary distribution. PNG “, dpi=600)
-
plt.show()
We can find that the salary of reptile engineers is more than 20000, accounting for half, especially around 20000. It seems that the reptile job is really delicious, are you sour? Haha
-
data = pd.DataFrame({“value”:sary_list})
-
cats1 = pd.cut(data[‘value’].values, bins=[8000, 10000, 20000, 30000, 50000,data[‘value’].max()+1])
-
pinshu = cats1.value_counts()
-
Pinshu_df = pd.dataframe (pinshu, columns=[‘ frequency ‘])
-
Pinshu_df [‘ frequency f] = pinshu_df/pinshu_df [‘ frequency ‘] sum ()
-
Pinshu_df [‘ frequency %] = pinshu_df [‘ frequency f] map (lambda x: ‘%. 2 f % % % (* 100) x)
-
Pinshu_df [‘ f’] = pinshu_df[‘ f’].cumsum()
-
Pinshu_df [‘ cumulative frequency %] = pinshu_df [‘ cumulative frequency f] map (lambda x: ‘%. 4 f % % % (* 100) x)
-
print(pinshu_df)
-
print()
-
Print (” print “)
From the salary range, between 10000-20000 stand most, basic very good salary, more than 20000+ have a few, is really too big temptation
Ok, this is the end of today’s sharing, we will see you next time