Recently, I am working on a project to analyze Python jobs. The background of this project is that I have been exposed to Python for a long time and have not had a comprehensive understanding of Python jobs. So I want to use this analysis to find out what python-related jobs are available, how the demand varies in different cities, what the salary is, and what the experience requirements are. The links analyzed include:

  • The data collection

  • Data cleaning

  1. The creation time of the exception
  2. Unusual salary levels
  3. Exceptional work experience
  • Statistical analysis
  1. The market data
  2. Unidimensional analysis
  3. Two dimensional cross analysis
  4. Multidimensional drill
  • The text analysis
  1. Text preprocessing
  2. The word cloud
  3. Fp-growth association analysis
  4. LDA topic model analysis

It is divided into two parts. The first part introduces the first three parts, and the second part focuses on text analysis.

0. Data collection

We do data analysis in most cases is to use the company’s business data, so there is no need to care about the problem of data collection. However, some data exploration we do in our spare time requires more data collection by ourselves, and the commonly used data collection technology is crawler.

The data used for this sharing is extracted from the dragnet, which is mainly divided into three parts: how to capture data, compile crawler to capture data, format captured data and save it to MongoDB. The data collection part is covered in a separate article, and the source code is open, so I won’t go over it here. If you want to know more about it, you can check out the previous article “Python Climbing Jobs”.

1. Clean data

Once you have the data, hold your horses. We need to have a general understanding of the data and eliminate some abnormal records in the process to prevent them from affecting the subsequent statistical results.

For example, if there are 101 jobs, 100 of which pay the normal 10K and one of which pays the outlier 1000K, the average salary calculated with outliers is 29.7K and the average salary calculated without outliers is 10K, a nearly three-fold difference.

Therefore, we should pay attention to data quality before making analysis, especially when the amount of data is relatively small. There are about 1W positions in this analysis, which is a relatively small amount of data, so it takes a lot of time to clean the data.

Let’s start with data cleaning and enter the coding phase

1.0 Filter python-related jobs

Import common libraries

import pandas as pd import numpy as np import matplotlib.pyplot as plt from pymongo import MongoClient from pylab import MPL mpl.rcparams ['font. Sans-serif '] = ['SimHei'] #Copy the code

Read data from MongoDB

= '192.168.29.132' mongoConn = MongoClient (host, Port =27017) db = mongoconn.get_database ('lagou') mon_data = db.py_positions. Find () # json turn DataFrame jobs = pd.json_normalize([record for record in mon_data])Copy the code

Preview the data

jobs.head(4)
Copy the code

Print out the line and line information for jobs

jobs.info()
Copy the code

A total of 1.9W jobs were read, but not all of them were Python related. So the first thing we need to do is filter python-related jobs using the rule that the job title or body contains a Python string

# extract position name or text contains python py_jobs = jobs [(jobs [' pName] STR. The lower (). STR. The contains (" python ")) | (jobs['pDetail'].str.lower().str.contains("python"))] py_jobs.info()Copy the code

After screening, only 10,705 posts remained, and we continued to clean this part of the posts.

1.1 Clean outliers based on creation time

The purging of the “job creation date” dimension is aimed at preventing the entry of jobs that are particularly dated, such as those created in 2000.

Import time def timestamp_to_date(ts) import time def timestamp_to_date(ts) ts = ts / 1000 time_local = time.localtime(ts) return time.strftime("%Y-%m", Py_jobs ['createMon'] = py_Jobs ['createTime']. Map (timestamP_to_date) # create month by job id py_jobs[['pId', 'createMon']].groupby('createMon').count()Copy the code

Positions in different months

Create the timestamp_to_date function to convert “job creation time” to “job creation month” and then count by “job creation month”. As a result, the timing of job creation was not particularly outrageous, meaning there were no outliers. Even so, I still filter for job creation, keeping only 10, 11, and December because those are the months when most jobs are created, and I only want to focus on new jobs.

Py_jobs_mon = py_jobs['createMon'] > '2020-09']Copy the code

1.2 Clean outliers by salary

The focus of the pay purge is to prevent certain positions from becoming too outrageous. This section mainly investigates three characteristics: outliers with high salary, outliers with low salary and outliers with large salary span.

First, list all your salaries

py_jobs_mon[['pId', 'salary']].groupby('salary').count().index.values
Copy the code

Take high-wage outliers and see if there are outliers

Py_jobs_mon ['salary'].isin([' salary', 'salary'])]Copy the code

Outliers with high salaries

Sure enough, I found an abnormal position, a fresh graduate intern unexpectedly gave 150K-200K, which obviously needs to be cleaned.

Similarly, we can find unusual positions with other characteristics

Section 1.3 is similar to cleaning outliers by experience, but I won’t post the code here to avoid getting too long. In all, 9,715 jobs remain after these three attributes are cleaned.

After data cleaning, we officially entered the link of analysis, which is divided into two parts, statistical analysis and text analysis. The former is to do statistics on numerical indicators, and the latter is to do text analysis. We are usually exposed to the former, which allows us to understand the object being analyzed from a macro perspective. Text analysis also has irreplaceable value, we focus on the next chapter.

2. Statistical analysis

When we do statistical analysis, we need to know not only the object of analysis, but also the object for which the analysis results are oriented. In this analysis, I’m supposed to be targeting college students, because they’re the ones who really want to learn about Python positions. Therefore, our analytical thinking should be based on what they want to see, rather than a random heap of data.

2.0 Bulk data

The data of statistical analysis is generally expanded from coarse to fine according to the data granularity, and the data with the coarsest granularity is the number without any filtering conditions or splitting according to any dimension. In our project, it’s actually the total number of jobs, and we see 9,715. If we compare it with Java and PHP jobs, we may be able to draw some conclusions, but simply looking at the total number is obviously of no practical reference value.

So next we need to do a fine-grained split by dimension.

2.1 Unidimensional analysis

Let’s go from coarse to fine, and first analyze it in a single dimension. What is the most important data a college student would like to know? I think it’s the distribution of jobs in different cities. Because for students, the first problem to consider the job is to consider which city to be in, and the number of jobs to consider is the number of jobs, the more jobs, the better the prospects.

# city FIG = plt.figure(dpi=85) py_jobs_final['city'].value_counts(Ascending =True).plot.barh() # city FIG = plt.figure(dpi=85) py_jobs_final['city'].value_counts(Ascending =True).plot.Copy the code

Number of jobs by city

Beijing has the most jobs, more than double the number of second-place Shanghai. Guangzhou has the least number of jobs, less than Shenzhen.

After deciding which city to develop in, the next step to consider is what position to engage in. We all know that Python has a wide range of applications, and it was natural to look at the distribution of Python jobs in different categories

Tmp_df = py_jobs_final.groupby(['p1stCat', 'p2ndCat']).count()[['_id']].sort_values(by='_id') tmp_df = tmp_df.rename(columns={'_id':'job_num'}) tmp_df = Tmp_df [tmp_df['job_num'] > 10] tmp_df.plot.barh(figsize=(12,8), fontsize=12)Copy the code

P1stCat and p2ndCat are tick marks, not my marks.

Statistically, we found that testing was the most common position for Python skills, data development was the second, and back-end development was relatively low, which was consistent with our perception.

Here we’re looking at the number of jobs, but you can also look at the average salary.

We have a general idea of Python jobs by city and job category. Do you need to look at other dimensions, such as salary and work experience, and these two dimensions are also concerned about. In my opinion, from a single dimension, city and job classifications are enough, and the rest have no practical reference value. Because the salary must be related to a certain type of position, the salary of artificial intelligence is naturally high; Similarly, work experience is also related to the job category. When big data just started, the job experience was naturally low. Therefore, these two dimensions have no reference value from a single dimension. It must be meaningful to look at a certain type of position. When we do statistical analysis, we should not clutter the data. We should think about the logic behind the data and whether it is valuable to the decision maker.

2.1 Two-dimensional cross analysis

For a student, once he has identified the city where he works and he knows the different job distribution, what kind of data do we need to show him to make a career decision?

For students who want to go to Beijing, he would like to know about the distribution of different types of positions in Beijing, the salary, the requirements for work experience, and what kind of companies are hiring. Similarly, students who want to go to Shanghai, Shenzhen and Guangzhou also have similar needs. Thus, we have identified the dimensions and indicators we need to analyze. The dimensions are city, job category, and the two need to cross. The indicators are the number of positions, average salary, work experience and company. The first three are easy to say, but the fourth one needs to find a quantitative indicator to describe, and HERE I choose the size of the company.

We already have the dimensions, we need to prepare indicators, for example: in our data set, the salary column is 15K-20K text, we need to process into numerical type. Take salary as an example, write a function to turn it into a number

Def get_salary_number(salary): salary = salary.lower().replace('k', '') salary_lu = salary.split('-') lower = int(salary_lu[0]) if len(salary_lu) == 1: return lower upper = int(salary_lu[1]) return (lower + upper) / 2Copy the code

Similar logic applies to work experience and company size, so I’ve added codes to save space.

Py_jobs_final ['salary_no'] = py_jobs_final['salary']. Map (get_salary_number) py_jobs_final[' work_YEAR_no '] =  py_jobs_final['workYear'].map(get_work_year_number) py_jobs_final['csize_no'] = py_jobs_final['cSize'].map(get_csize_number)Copy the code

With dimensions and metrics, how do we present the data? Most of the data we present is two-dimensional, with the abscissa as the dimension and the ordinate as the index. Since we want to show the index of two dimensional intersection, it is natural to use three dimensional graphics to display. Here we use Axes3D to draw

# only choose to develop | | test operations class level classification, Job_arr = [' Test ', 'data Development ',' ai ', 'operation ',' operation ', 'back-end development] py_jobs_2ndcat = py_jobs_final [(py_jobs_final [' p1stCat] = =' development | | test operations class ') & (py_jobs_final['p2ndCat'].isin(job_arr))] %matplotlib notebook import matplotlib.pyplot as plt from mpl_toolkits.mplot3d City_map = {' Beijing ': 0, 'Shanghai ': 1,' Guangzhou ': 2, 'Shenzhen ': 3} # Convert cities to numbers and display idx_map = {'pId': 'position number', 'salary_no' : 'salary (unit: k)', 'work_year_no' : 'work experience (unit:)', 'csize_no' : } FIG = plt.figure() for I,col in enumerate(idx_map.keys()): if col == 'pId': aggfunc = 'count' else aggfunc = 'mean' jobs_pivot = py_jobs_2ndcat.pivot_table(index='p2ndCat', columns='city', values=col, aggfunc=aggfunc) ax = fig.add_subplot(2, 2, i+1, projection='3d') for c, city in zip(['r', 'g', 'b', 'y'], city_map.keys()): ys = [jobs_pivot[city][job_name] for job_name in job_arr] cs = [c] * len(job_arr) ax.bar(job_arr, ys, zs=city_map[city], zdir='y', Color =cs) ax.set_ylabel(' city ') ax.set_zlabel(idx_map[col]) ax.legend(city_map.keys()) plt.show()Copy the code

First, I only selected the top5 job categories, and then cyclically calculated each indicator using pivot_table in DataFrame, which can easily aggregate the two-dimensional indicators and obtain the data we want. Finally, dimensions and indicators are displayed in 3d histogram.

Taking Beijing as an example, we can see that ai positions have the highest salaries, data development and back-end development are similar, and testing and operations are lower. Artificial intelligence generally requires less work experience than other positions. After all, it is a new position, which is consistent with our cognition. The average size of companies hiring for AI positions is smaller than for other positions, indicating that there are more AI startups, while testing and data development companies are larger. After all, small companies rarely test, and they don’t have as much data.

It’s important to note that the absolute value of all indicators except jobs is skewed because of the logic we’re dealing with. However, the processing method used by different positions is the same, so the indicators between different positions are comparable, that is to say, the absolute value is not meaningful, but the partial order relationship of different positions is meaningful.

2.3 Multidimensional drilling

What else does a student want to know after he has decided on a city and a position? For example, he might want to know what the salary, experience requirements and company size are in Beijing, ai positions, different industries, or what the salary and experience requirements are in Beijing, AI positions, different size companies.

This involves the intersection of three dimensions. In theory we can do a crossover analysis of any dimension, but the more dimensions we have the smaller our field of vision, the more focused our focus. In this case, we tend to fix the values of certain dimensions to analyze the situation of other dimensions.

Taking Beijing as an example, we looked at the salary distribution for different positions and experience requirements

Tmp_df = py_jobs_2ndcat[(py_jobs_2ndcat['city'] == 'Beijing ')] tmp_df = tmp_df. Pivot_table (index='workYear', Columns ='p2ndCat', values='salary_no', aggfunc='mean').sort_values(by=' ai ') tmp_dfCopy the code

To get a more intuitive view of the data, let’s draw a two-dimensional scatter diagram, where the size of the points is the number of salaries

[plt.scatter(job_name, wy, c='darkred', s=tmp_df[job_name][wy]*5) for wy in tmp_df.index.values for job_name in job_arr]
Copy the code

We can compare this data either horizontally or vertically. In a horizontal comparison, we can see that the salary level of ARTIFICIAL intelligence is generally higher than that of other positions for the same work experience. In a vertical comparison, salaries for AI jobs have grown much faster over years of service than for other jobs (the circles become larger than the others).

So what you do matters.

Of course, if you feel out of focus, you can still drill. For example, if you want to see the salary situation of Beijing, ARTIFICIAL intelligence jobs, e-commerce industry and different company sizes, the processing logic is the same.

We continue to show you how to analyze Python jobs using text mining. It will include some data mining algorithms, but I hope this article is for the algorithm white, which will not involve the principle of the algorithm, can use, can solve business problems.

3.0 Text preprocessing

The purpose of text preprocessing is the same as the data cleaning introduced in the previous chapter, which is to process the data into what we need. This step mainly includes word segmentation and the removal of stop words.

Py_jobs_finalDataFrame py_jobs_finalDataFrame py_jobs_finalDataFrame

py_jobs_final[['pId', 'pDetail']].head(2)
Copy the code

The main body of the position is the pDetail column, which is often seen as “job responsibilities” and “job requirements”. This is because pDetail is supposed to be displayed on a web page, so there is an HTML tag in it. Fortunately, we have a crawler basis and BeautifulSoup is easy to remove

From bS4 import BeautifulSoup# BeautifulSoup removes the HTML tag and keeps only the text, Py_jobs_final ['p_text'] = py_jobs_final['pDetail']. Map (lambda x: BeautifulSoup(x, 'lxml').get_text().lower())py_jobs_final[['pId', 'pDetail', 'p_text']].head(2)Copy the code

After removing HTML tags, use jieba module to divide words into main texts. Jieba provides three modes for word segmentation: full mode, Precise mode and search engine mode. We can see the specific differences by looking at an example.

Import jieba job_req = 'familiar with object-oriented programming, master Java /c++/python/ PHP at least one language; # Jieba. cut(job_req, cut_all=True) # Jieba. cut(job_req, cut_all=True) Cut_all =False) # seg_list = jieba.cut_for_search(job_req)Copy the code

All model

Accurate model

Search engine model

The difference is clear. For this analysis, I’m using a precise model.

py_jobs_final['p_text_cut'] = py_jobs_final['p_text'].map(lambda x: list(jieba.cut(x, cut_all=False)))

py_jobs_final[['pId', 'p_text', 'p_text_cut']].head()
Copy the code

After word segmentation, we found that it contained a lot of punctuation marks and some meaningless function words, which did not help our analysis, so the next thing we need to do is to remove the stop words.

Stop_words = [line.strip() for line in open('stop_words.txt',encoding=' utF-8 ').readlines()] Def remove_stop_word(p_text): if not p_text: return p_text new_p_txt = [] for word in p_text: if word not in stop_words: new_p_txt.append(word) return new_p_txt py_jobs_final['p_text_clean'] = py_jobs_final['p_text_cut'].map(remove_stop_word) py_jobs_final[['pId', 'p_text_cut', 'p_text_clean']].head()Copy the code

After the above three steps, the P_TEXt_clean column is clean and ready for subsequent analysis.

3.1 FP-growth Mining association

The first text analysis I did was to mine associations. The example that comes to mind when it comes to association analysis is “beer and diapers”, and I want to use this idea to mine which words have strong correlations for different Python positions. The mining algorithm uses THE FP-growth of MLXtend module, and fP-growth realizes the mining of association rules faster than Apriori.

From mlxtend. Preprocessing import TransactionEncoder from mlxtend. Frequent_patterns import fpgrowth # Construct fp-growth input data  def get_fpgrowth_input_df(dataset): te = TransactionEncoder() te_ary = te.fit(dataset).transform(dataset) return pd.DataFrame(te_ary, columns=te.columns_)Copy the code

Let’s start by digging into the “ai” category

Ai_jobs = py_jobs_final [(py_jobs_final [' p1stCat] = = 'development | | test operations class') & (py_jobs_final [' p2ndCat] = = 'artificial intelligence')] ai_fpg_in_df = Get_fpgrowth_input_df (ai_jobs['p_text_clean']. Values) ai_fpG_df = fpGROWTH (ai_fpG_in_df, min_support=0.6, use_colnames=True)Copy the code

The min_support parameter is used to set the minimum level of support and to reserve frequent itemsets whose frequencies are greater than this value. For example, in 100 shopping orders, there are 70 orders containing “beer”, 75 orders containing “diapers”, and 1 order containing “apple”. When min_support=0.6, “beer” and “diapers” will remain, and “apple” will be discarded, because 1/100 < 0.6.

Look at the result for ai_fpG_df

The itemsets column is a frequent itemset, of type Frozenset, and it contains one or more elements. Support is the frequency of frequent itemsets, which are all greater than 0.6. Line 0 (python) represents the word python in 99.6% of jobs, and line 16 represents the coexistence of Python and algorithms in 93.8% of jobs.

With this we can calculate correlations based on Bayesian formulas, such as: I see the c + +, then I want to see how much probability of python posts will also require a c + +, according to the formulas of conditional probability p (c + + | python) = p (c + +, python)/p (python) to the following calculation

P_python = ai_fpG_df [ai_fpg_df['itemsets'] == frozenSet (['python'])]['support']. Values [0] # P_python_cpp = ai_fpg_df[ai_fpg_df['itemsets'] == frozenset(['python', 'c++'])]['support']. Values [0] # The probability of a c + + print (' p (c + + | python) = % f '% (p_python_cpp/p_python))Copy the code

It turned out to be 64%. For ai jobs that require python, c++ is also required 64% of the time. We can also look at python’s relationships with other words

Python is 94% correlated with the algorithm, which is as expected, given that it’s ai jobs that are being screened. For Python positions, the probability of machine learning and deep learning is about the same (69%), and the probability of machine learning is slightly higher (70%), so there seems to be not much difference between the two positions. And the requirement for experience seems pretty rigid, 85 percent of the time.

Similarly, let’s look at correlation analysis for data development jobs

One obvious difference is the more technical categories in the ai category that have a strong python connection: machine learning, deep learning, and c++. And the words in data development are obviously more businesslike, in this case business, analytics. There is a 60% or more chance that python will be mentioned in business or analytics, since data is closely related to business.

Association rules are more word granularity, a little too fine. Next, we will analyze documents with increasing granularity.

3.2 Subject model analysis

LDA(Latent Dirichlet Allocation) is a document body generation model. The model assumes that the topic of a document follows the Dirichlet distribution, and the words in a topic also obey the Dirichlet distribution. Various optimization algorithms are used to solve the two implicit distributions.

Here we call the LDA algorithm in SkLearn to do this

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

def run_lda(corpus, k):
    cntvec = CountVectorizer(min_df=1, token_pattern='\w+')
    cnttf = cntvec.fit_transform(corpus)
    
    lda = LatentDirichletAllocation(n_components=k)
    docres = lda.fit_transform(cnttf)
    
    return cntvec, cnttf, docres, lda
Copy the code

Here we use CountVectorizer to count word frequency to generate word vectors as input to LDA. You can also generate word vectors using deep learning, and the benefit is that you can learn the relationships between words.

The LDA sets only one parameter, n_components, which is how many topics you need to divide jobs into.

Let’s start by categorizing AI jobs into eight themes

cntvec, cnttf, docres, lda = run_lda(ai_jobs['p_corp'].values, 8)
Copy the code

The lda.components_ call returns a two-dimensional array, each row representing a topic, and each row representing the distribution of words under that topic. We need to define a function that prints out the words with the highest probability of occurrence for each topic

def get_topic_word(topics, words, topK=10): res = [] for topic in topics: Sorted_arr = np. Argsort (topic) [: topK] # reverse line take topK res. Append (', '. Join ([' % s: % 2 f % (words [I]. topic[i]) for i in sorted_arr])) return '\n\n'.join(res)Copy the code

Output the distribution of topics and top words under the ai theme

print(get_topic_word(lda.components_ / lda.components_.sum(axis=1)[:, np.newaxis], cntvec.get_feature_names(), 20))
Copy the code

Lda.components_ / lda.components_. Sum (axis=1)[:, np.newaxis] is intended for normalization.

Can see the first topic is related to the natural language, a second theme is related to voice, the third theme is quantitative finance investment, the fourth topic is health related, the fifth theme is relevant machine learning algorithm, the sixth subject is English position, the seventh topic is computer vision, the eighth subject is simulation, robot.

Feel cent of return ok, at least a few big directions can cent come out. And each class has a distinct distinction before it.

Similarly, let’s look at the topics for data development positions, which are divided into six topics

The first topic is related to data warehouse and big data technology, the second topic is Related to English jobs, the third topic is related to database and cloud, the fourth topic is related to algorithms, the fifth topic is related to business and analysis, and the sixth topic is crawler, which is also ok.

Here I am interested in the position of artificial intelligence and data development. I can also do testing and back-end development, which we focused on before, with the same idea.

At this point, our text analysis is over. We can see that text analysis can mine information that is not counted in statistical analysis, and we will often use it in subsequent analysis. In addition, due to time reasons, we have not had time to do the word cloud part. We have done this part before, which is not very complicated. We can try to draw word clouds of different job categories with TF-IDF. Complete code is still sorting, need friends can give me a message.

Welcome public account “Duomo”