Author/Wei Kai

Student of Udacity’s “Data Analyst” nanodegree programme


Since starting Udacity’s “data analyst” nanodegree last July, I’ve learned a lot and plan to start looking for a job slowly. Since you want to work as a data analyst, you need to know something about this position first. The most direct and truthful way is to get information from the company about their needs, so that they can best guide their learning direction and resume preparation. This project is to use the crawler to crawl the information of the data analysis position on the pull box, and then conduct some exploration and analysis, to understand the “data analysis” by data analysis.


The data source

The data set used in this project is all from the pull – up network, which is crawled by the web crawler. Collection search is a simple and easy to use and powerful web crawler product. Crawler customization and operation can be realized by clicking the mouse and simple command operation, which is also recommended here. The reason why WE choose Ragou.com as the data source of this project is that compared with other recruitment websites, the job information on Ragou.com is very complete and clean, and there is very little information missing. And almost all the displayed information is very standardized, greatly reducing the workload of data cleaning and data sorting in the early stage. (AFTER all, the author is finished after work, time is limited, can save save) This time to climb the information, mainly obtained the following information:

content field
Post the name title
Monthly salary month_salary
The name of the company company
Subordinate to the industry industry
The company size scale
The financing stage phase
investors investors
city city
Experience requirements experience
Degree required qualification
Full-time/Part-time full_or_parttime
Job description and requirements description


Project purpose

The main purpose of Udacity data analyst course is to answer some doubts about data analysis position through actual data. Specifically, it mainly aims at the following problems:


– Regional distribution of job demand for data analysts;

– The distribution of remuneration across the whole group;

– What is the salary of data analysts in different cities?

– What is the working experience requirement of this position;

– How does the salary change according to experience?

– What skills should a data analyst have from an employer’s perspective?

– Does mastering different skills have an impact on salary? What is the impact?


Techniques and Tools

This project is mainly divided into two parts, the first part is data crawling, using the collection search web crawler tool. The second part is data analysis, based on the Python programming language. Pandas is used for data processing and statistical analysis, matplotlib is used for graphics visualization, and Seaborn library package is used for graphics beautify. In skill requirement analysis, jieba is used as the word segmentation tool kit and wordcloud package is used to make wordcloud.


Data sorting


Load and clean


* Click on the thumbnail of the picture to enlarge, same as below.








It can be seen that after preliminary cleaning, there are 13 effective variables and 575 data records in the data set. Except for investor, the data integrity of other fields is very good, with almost no missing values. This is great news for subsequent analysis.



The data analysis


Regional distribution




<matplotlib.text.Text at 0x1102e1f90>Copy the code




On the check box, there are 29 cities in China in terms of talent demand for enterprise postal data analysts, nearly half of which are in Beijing and the demand is the first in China. The top 5 cities are Beijing, Shanghai, Shenzhen, Hangzhou and Guangzhou. The profession of data analysis is largely concentrated in the four first-tier cities of Beijing, Shanghai, Guangzhou and Shenzhen, as well as Hangzhou, a cluster of Internet and e-commerce enterprises. I was a little surprised by the sheer size of the demand in Beijing, but the result was reasonable considering that Laogou is a recruitment platform that focuses on Internet-related industries and a large number of China’s Internet companies are clustered in Beijing. Later have time, can make an analysis to distribution characteristic of countrywide Internet industry.


All in all, a clear conclusion can be drawn: there are a large number of job opportunities for data analysis in Beijing, Shanghai, Guangzhou, Shenzhen and Hangzhou. Students who want to develop in this direction should try more in these cities. Of course, on the other hand, these cities also have a large concentration of talents from various industries, so the competitive pressure must be great.


Overall compensation




/ Users/carrey/anaconda/lib/python2.7 / site - packages/ipykernel / __main__. Py: 16: SettingWithCopyWarning: A value is trying to beset on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copyCopy the code


([<matplotlib.axis.XTick at 0x11ccaa290>,
  <matplotlib.axis.XTick at 0x11d478210>,
  <matplotlib.axis.XTick at 0x11d5652d0>,
  <matplotlib.axis.XTick at 0x11d602f10>,
  <matplotlib.axis.XTick at 0x11d6116d0>,
  <matplotlib.axis.XTick at 0x11d528290>,
  <matplotlib.axis.XTick at 0x126eb4c10>,
  <matplotlib.axis.XTick at 0x11d441e90>,
  <matplotlib.axis.XTick at 0x11d611bd0>,
  <matplotlib.axis.XTick at 0x11d618390>,
  <matplotlib.axis.XTick at 0x11d618b10>,
  <matplotlib.axis.XTick at 0x11d6242d0>,
  <matplotlib.axis.XTick at 0x11d624a50>,
  <matplotlib.axis.XTick at 0x11d62d210>,
  <matplotlib.axis.XTick at 0x11d62d990>,
  <matplotlib.axis.XTick at 0x11d637150>,
  <matplotlib.axis.XTick at 0x11d6378d0>,
  <matplotlib.axis.XTick at 0x11d642090>,
  <matplotlib.axis.XTick at 0x11d642810>],
 <a list of 19 Text xticklabel objects>)Copy the code




Like most other jobs, data analyst pay is skewed to the right. Most people’s income is concentrated between 5K and 20K per month, only a few people can get a higher salary, but there are a very few people with extremely high salary, which makes people look forward to. It should be noted that the salary value on the check box is an interval value and overlaps with each other. To facilitate analysis, I take the median value of the interval as the representative value for analysis. So the actual distribution of pay may be a little better than the picture shows. There will always be someone who can get a pay cap. Overall, the salary income of data analyst is still considerable, from this aspect, it is a good choice of career.


Salary distribution in different cities





<matplotlib.text.Text at 0x115796650>



Ignoring the cities with low talent demand, I focused on the top six cities. As can be seen from the chart, the distribution of salary in these six cities is generally concentrated, which is consistent with the distribution of salary in the whole country as seen above. The median salary distribution in Shenzhen is about 15K, ranking first in China. It was followed by Beijing at 12.5K, followed by Shanghai and Hangzhou. Shenzhen is indeed a miracle city, here also gave me a little surprise. From the perspective of treatment, data analysts stay in Shenzhen development is a good choice.



Work experience requirements




/ Users/carrey/anaconda/lib/python2.7 / site - packages/ipykernel / __main__. Py: 7: SettingWithCopyWarning: A value is trying to beset on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy/ Users/carrey/anaconda/lib/python2.7 / site - packages/pandas/core/indexing. The py: 132: SettingWithCopyWarning: A value is trying to beset on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copySelf. _setitem_with_indexer (indexer, value)/Users/carrey/anaconda/lib/python2.7 / site - packages/ipykernel / __main__. Py: 13: SettingWithCopyWarning: A value is trying to beset on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy/ Users/carrey/anaconda/lib/python2.7 / site - packages/ipykernel / __main__. Py: 25: FutureWarning: sort (columns =...). is deprecated, use sort_values(by=.....)Copy the code


<matplotlib.text.Text at 0x110577dd0>




Unsurprisingly, the demand distribution for work experience approximates the normal distribution. Experienced people with one to three years of experience are most in demand, followed by experienced analysts with three to five years of experience. Less than one year of work experience, the market demand is less. In addition, demand for five to 10 years of experience is rare, and those with more than 10 years are rare.


From this distribution we can roughly guess:


Data analysis is a young career direction, and a lot of work experience needs to be concentrated in 1-3 years; For data analysts, five years is a bottleneck period. If there is no transformation or qualitative improvement within five years, the competitive pressure will probably be greater in the future.


Salary distribution by experience




<matplotlib.text.Text at 0x11cc58f50>



It’s no surprise that data analysts are getting paid more as they gain experience. What’s more, based on the current data, data analysts seem to be a permanent career path that probably won’t see a decline in earnings as people age over the next 10 years.


Key words of vocational skills






Building prefix dict from the default dictionary ... Loading the model from the cache/var/folders/p7/6 s6n_sw53dq_w9j52wlzyl800000gn/T/jieba cache Loading model cost 0.417 seconds. Prefix dict has had built succesfully. / Users/carrey/anaconda/lib/python2.7 / site - packages/ipykernel / __main__. Py: 7: SettingWithCopyWarning: A value is trying to beset on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copyCopy the code





The word cloud showed a little more than I expected. For the data analyst position, the most frequently needed skills are not Python and R, which are now very fashionable data analysis languages, but traditional structured query language SQL and Excel. Note that SQL and Excel seem to be essential skills for a data analyst position. As can be seen from the word cloud, data analyst skills demand frequency is in the forefront: SQL, Excel, SAS, SPSS, Python, Hadoop and MySQL, etc. In addition, Java, PPT, BI software and so on belong to the second tier.


The impact of mastering different skills on salary income






/ Users/carrey/anaconda/lib/python2.7 / site - packages/ipykernel / __main__. Py: 13: SettingWithCopyWarning: A value is trying to beset on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copyCopy the code

<matplotlib.text.Text at 0x11f59b890>




I made statistical calculations for the top 15 skills with the highest demand frequency, and obtained the average salary level corresponding to each skill, as shown in the figure above. The size of the dot represents how much of the skill is needed.


Among the top 15 skills, Shell, Hive, and Spark have the highest average salary and have a big difference compared to other skills. Hive and Spark are both used for distributed data processing, and shell scripting is a required skill for Linux. These three points to a direction together, that is the distributed processing of massive data!


So, want to get a high salary partners note, massive data processing, distributed processing framework is the right direction towards high salary. It’s also worth noting that the average salary for data analysis is higher in Python than in Java, which is currently in the ascendant. The SQL language and the traditional SAS and SPSS data analysis software can enable you to adapt to the requirements of more enterprises under the condition of ensuring a medium income, which means more job opportunities.


Analysis conclusion

Through the above analysis, we can draw the following conclusions: for data analysis, a large number of job opportunities are concentrated in Beijing, Shanghai, Guangzhou, Shenzhen and Hangzhou. Most data analysts earn between 5K and 20K per month, and only a few can get higher pay, but a very small number of people earn very high pay, which makes people look forward to it.


From the perspective of treatment, it is a good choice for data analysts to stay in Shenzhen, followed by Beijing and Shanghai. Data analysis is a young career direction, with a large amount of work experience required in 1-3 years.


For data analysts, five years seems to be a bottleneck period. If there is no transformation or qualitative improvement within five years, the competitive pressure will probably be greater in the future. As experience increases, so does the salary of data analysts, and those with more than 10 years of experience can be paid quite well.


The skills required by data analysts are at the top of the list: SQL, Excel, SAS, SPSS, Python, Hadoop and MySQL, among which SQL and Excel are almost necessary skills. Massive data, distributed processing framework is the right direction towards high salary. SQL language and traditional SAS, SPSS two big data analysis software, can make you adapt to the requirements of more enterprises under the condition of ensuring the middle income, which means more job opportunities.


Thinking and summarizing

The analysis of data analyst skills is relatively simple. In this analysis, only instrumental skills are analyzed. But in fact, data analysts need far more than these qualities, but also need to have a solid foundation of mathematics, statistics, good data sensitivity, pioneering but rigorous thinking. It would be more interesting to dig into these things. However, if you want to carry out this content, you need to master a lot of knowledge and skills in Chinese word segmentation, keyword extraction and other aspects, which will be more difficult. Limited by time, I will not expand further here. I hope I can do a special analysis later. It is ironic that the Python2.X environment does not support Chinese encoding well enough, which consumes a lot of time and energy in processing data, and also makes a lot of mistakes and detours. You may want to focus on this later, or consider switching to python3.


Special note: The data source of this time is completely from Lazou.com, but Lazou.com itself is a recruitment platform focusing on Internet-related industries, so the conclusions drawn from this analysis are more applicable to enterprises related to the Internet industry, but may not be suitable for enterprises in other industries.