Using Python to crawl 13,966 operations job postings, what did I get?

Author: JackTian wechat public ID: Jake_Internet

Public number: Jie Ge’s IT journey, background reply: “operation and maintenance” can obtain the complete data of this article

Hello, I’m JackTian.

I often get a series of operations inquiries from readers, such as: Jack, what exactly does operations do? What about the salary/treatment of operation and maintenance? Jie elder brother help to see the recruitment needs of this post for small white, whether competent? And so on.

Here, I write before the show you read an article from the elementary operations engineers to study senior operations specialist route “, this article from the junior/intermediate/advanced operations engineers and senior direction gradually expand to the stages to summarize some of the skills, only reference for learning course, if there are any supplement, can leave a message through this article to participate in interaction.

This time, With a kind of curiosity, Jie Ge made a preliminary analysis on the recruitment of operation and maintenance engineers in the industry based on his own work experience. Huang Wei, a good friend of mine, helped me to collect 13,966 recruitment information about operation and maintenance to see which data had relevant differences. The main contents include:

Top10 employers in hot industries
Top10 in number of jobs in popular cities
Provincial distribution of posts
The employment situation of different company sizes
The average salary of the top 10 jobs
The educational requirements of the position
Word cloud distribution of demand for operation and maintenance positions

For the narration of this article, we are divided into the following three steps for you to explain.

The crawler parts
Data cleaning
Data visualization and analysis

1. Crawler

In this paper, the data of 51Job is retrieved. Xpath is used for site parsing, Pandas is used for data cleaning, and Pyecharts is used for visualization.

Relevant notes have been indicated in the code, for easy reading, only part of the code is shown here, the complete code can be obtained by viewing the end of the article.

# 1, post name job_name = dom. Xpath (' / / div [@ class = "dw_table"] / div [@ class = "el"] / / p/span/a [@ target = "_blank"] / @ the title ') # 2, the name of the company company_name = dom.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t2"]/a[@target="_blank"]/@title') # Address = dom.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t3"]/text()') # Dom. Xpath ('//div[@class="dw_table"]/div[@class="el"]/span[@class=" T4 "]') salary = [i.ext for I in salary_mid] # 5, release date Release_time = dom. Xpath (' / / div [@ class = "dw_table"] / div [@ class = "el"] / span [@ class = "t5"] / text () ') # 6, obtain secondary url deep_url = Dom. Xpath (' / / div [@ class = "dw_table"] / div [@ class = "el"] / / p/span/a [@ target = "_blank"] / @ href ') # 7, crawl experience, education information, first close within a field, Do data cleaning later. Name it random_all random_all = dom_test.xpath('//div[@class="tHeader tHjob"]//div[@class="cn"]/p[@class=" MSG ") Xpath ('//div[@class="tBorderTop_box"]//div[@class=" BMSG job_msg" Inbox "]/p/text()') # 9, Company_type = Dom_test. xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[1]/@title') # 10, Company_size = Dom_test. xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[2]/@title') # 11 dom_test.xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[3]/@title')Copy the code

2. Data cleaning

1) Read the data

# the related library used below, Import pandas as pd import numpy as NP import re import ba df = pd.read_csv("only_yun_wei.csv",encoding="gbk",header=None) df.head()Copy the code

2) Set new row and column indexes for the data

Df.index = range(len(df)) df.index = range(len(df) [" post ", "company name", "work", "salary", "release date", "experience and education", "type", "company", "industry", "job description"] df. Head ()Copy the code

3) To reprocess

Print (" Delete former number of records ",df.shape) # Delete df.drop_duplicates(subset=[" Company name "," job name "," working place "],inplace=True) # Delete former number of records Print (" Df.shape ",df.shape)Copy the code

4) Processing of post name field

Df [" 表 示 "].value_counts() df[" 表 示 "] = df[" 表 示 "]. Apply (lambda x: x.power ()) # Do a data filter df. Shape target_job = [' operations', 'Linux ops',' operational development ', 'enterprise', 'application operations',' system operations', 'database operations',' operational security, network operations, 'desktop operations] index = Count (I) for I in target_job index = np.array(index). Sum (axis=0) > 0 job_info = df[index] job_info.shape Job_list = [' Linux ops', 'operational development', 'enterprise', 'application operations',' system operations', 'database operations',' operational security, network operations, 'desktop operations',' it ops', 'software operations',' operations engineers'] job_list = np.array(job_list) def rename(x=None,job_list=job_list): index = [i in x for i in job_list] if sum(index) > 0: return job_list[index][0] else: Return x job_info[" job_info "] = job_info[" job_info "]. Apply (rename) job_info[" job_info "].value_counts()[:10]Copy the code

5) Processing of salary field

Job_info [r]. "wage" STR [1]. Value_counts () job_info [r]. "wage" STR [3] value_counts index1 = () Job_info [r]. "wage" STR [1]. The isin ([" year ", "month"]) index2 = job_info [r]. "wage" STR [3]. The isin ([" wan ", "qian"]) job_info = job_info [index1 & Job_info index2] [r]. "wage" STR [r]. - 3: value_counts (def) get_money_max_min (x) : try: if x [3] = = "m" : Z = [float (I) * 10000 for I re in the.findall (" [0-9] + \.? [0-9] * ", x)] elif x [3] = = "thousand" : Z = [float (I) * 1000 for I re in the.findall (" [0-9] + \.? [0-9] * ", x)] if x [1] = = "years" : z = [I / 12 for I z] in return z except: If (get_money_max_min) job_info[" max_min "] = salary. STR [0] job_info[" max_min "] = salary Salary. The STR [1] job_info [" wages "] = job_info [[" minimum wage ", "high salary"]]. Scheme (axis = 1)Copy the code

6) Processing of work place field

Address_list = [' Beijing ', 'Shanghai', 'guangzhou', 'shenzhen, hangzhou, suzhou, changsha, wuhan, tianjin, chengdu, xi 'an, dongguan, hefei, foshan,' ningbo ', 'nanjing, chongqing, Changzhou zhengzhou 'changchun', 'a', ' ', 'fuzhou, shenyang, jinan,' ningbo ', 'xiamen', 'guizhou,' zhuhai, Qingdao, zhongshan, Array (address_list) def rename(x=None,address_list=address_list): if (x=None,address_list=address_list)  index = [i in x for i in address_list] if sum(index) > 0: return address_list[index][0] else: Return x job_info[" job_info "] = job_info[" job_info "]. Apply (rename) job_info[" job_info "].value_counts()Copy the code

7) Processing of company type field

Job_info. Loc [job_info [r]. "the company type" apply (lambda x: len (x) < 6), "is the company type"] = np. Nan job_info [the company] type. = job_info [r]. "the company type" STR [2: - 2] Job_info [" company type "].value_counts()Copy the code

8) Processing of industry fields

Job_info = job_info [" industry "] [" industry "]. Apply (lambda x: re sub (", ", "/", x)) job_info. Loc [job_info [r]. "industry" apply (lambda X: len (x) < 6), "industry"] = np. Nan job_info [" industry "] = job_info [r]. "industry" STR [r]. 2: - 2 STR. The split ("/"). The STR [0] job_info [r]. "industry" value_counts ()Copy the code

9) Processing of experience and education fields

Job_info [" degree "] = job_info [r]. "experience and qualifications" apply (lambda x: re the.findall (" undergraduate course | Dr College fresh raw master of students | | | | ", x)) def func (x) : if len (x) = = 0: return np.nan elif len(x) == 1 or len(x) == 2: return x[0] else: Return x[2] job_info[" background "] = job_info[" background "].apply(func) job_info[" background "].value_counts()Copy the code

10) Processing of company size field

Def func (x) : if x = = "[' less than 50 people] :" return "< 50" elif x = = "[' 50-150] :" return "50-150" elif x = = "(' 150-500)" : Return '150-500' elif x == "['500-1000 ']": return '500-1000' elif x == "['1000-5000 ']": Return '1000-5000' elif x == "[' 1000-10000 ']": return '5000-10000' elif x == "[' 1000-10000 ']": return ">10000" else: Return np.nan job_info[" job_info "] = job_info[" job_info "].apply(func)Copy the code

11) Construct new data and export the processed data into new Excel

Feature = [" company name ", "position name" and "work", "wages", "release date", "degree", "type", "company", "industry", "job description"] final_df = job_info [feature] Final_df.to_excel (r" visuals.xlsx ",encoding=" GBK ",index=None)Copy the code

3. Data visualization

1) Visualization of large screen effect

2) Top10 employers in hot industries

From the recruitment industry data, computer software, computer services, Internet, communication industry employment demand will be higher than other industries.

3) Top10 in number of jobs in popular cities

In terms of hot cities, first-tier cities in Beijing, Shanghai, Guangzhou and Shenzhen account for a large number of jobs. However, the recruitment data from other places and the combination of past experience here tend to be outsourcing enterprises.

4) Provincial distribution of posts

From the color bar on the far left, it can be seen that the darkest region has a higher concentration of job recruitment, while the lightest region has a lower number of job recruitment. From below look, color of Guangdong province, Jiangsu province, Shanghai and Beijing occupy distribution can be more concentrated compared to other provinces.

5) Employment situation of different company sizes

Different industries, the size of the company is certainly different. Company size refers to the size of a company divided according to relevant standards and regulations, which can be generally divided into extremely large, large, medium, small and micro. As shown in the figure below, the number of employees in the range of 50-500 accounts for more than 50%, and the demand for employees is the highest. The number of employees in the range of 1000-10000 accounts for less than 50%, but such companies are already relatively large.

6) The average salary for the top 10 jobs

According to my understanding, such as: system engineer, software implementation/engineers, operations specialist and a series of jobs is also can be divided in the field of operations within a category, position of each company to ops workers definition is different, in order to be able to more accurate filtering analysis, remove the accounts for those jobs. I left the following 10 job titles (OPERATION and maintenance development, operation and Maintenance Engineer, software operation and maintenance, network operation and maintenance, system operation and maintenance, desktop operation and maintenance, database operation and maintenance, application operation and maintenance, Linux operation and maintenance, IT operation and maintenance), which are basically the most job titles I have ever seen.

The average salary of top 10 positions is more than 1W for operation and maintenance development, application operation and maintenance, database operation and maintenance and Linux operation and maintenance. Therefore, it can also be seen that the advantage of operation and maintenance development in the field of operation and maintenance is to occupy a leading position.

7) Distribution of educational requirements for operation and maintenance positions

In terms of educational requirements, junior college and bachelor degree accounted for the majority. There are too few students, master’s degree and doctor’s degree, so some of my student readers will ask me, is it easy for a fresh graduate to find a job in operation and maintenance? From my personal point of view, I don’t recommend you to go into operations after graduation. Because operation and maintenance have very high requirements for your personal technical level and work experience, and for a newly graduated student, there is no too much practical experience, there will be no great advantage, unless this position is very interested in you, but you can also try.

8) Word cloud distribution of demand for operation and maintenance positions

According to the word cloud diagram of the recruitment requirements for operation and maintenance positions, the word frequency is the largest, mainly including operation and maintenance, ability, system, maintenance, experience, etc. Therefore, it can also be seen that operation and maintenance positions require very high personal technical ability and previous work experience. Of course, there are many other related word frequencies, see the following figure for details.

conclusion

With so much introduction, I believe you have a preliminary understanding and understanding of operation and maintenance engineers. Through this article, you can understand which industries have high demand for operation and maintenance personnel? What are the most popular cities for recruitment and operation? The distribution of operation and maintenance positions, the proportion of operation and maintenance engineers employed by different companies, the average salary of operation and maintenance positions, the requirements for educational background of operation and maintenance positions, and what word frequency is included in the word cloud map, through the analysis of this data, I believe I can give you a preliminary judgment and choice on the direction, industry, city and company scale of operation and maintenance in the future. I hope it will be helpful to you.

About for operational positions have skill points, refer to “take you read an article from the elementary operations engineers to study senior operations specialist route”, if you have operations and other related questions, also can leave a message through this article to participate in interaction, and for those who have problems related to most readers, I also can leave a message from digging in the area, Do you need to write related articles in the future? So, feel free to leave a comment.

Original is not easy, code word is not easy, if you think this article is useful to you, please leave a comment, like, or forward it, so that more operation and maintenance engineers can see it. Because this will be my continuous output of more high-quality articles the strongest power! Thank you!

Using Python to crawl 13,966 operations job postings, what did I get?

1. Crawler

2. Data cleaning

3. Data visualization

conclusion

Related Posts

Mybatis source code parsing (five) — SQL parameter processing

Golang Basics – Use of Pointers

String ::Manacher algorithm