What does a complete Python data analysis process look like?

Use Python to grab data from the site, save the data to an SQLite database, clean the data, and finally perform data visualization analysis.

For those familiar with python, it is easy to climb, but difficult to parse. SQL statements, Pandas, and Matplotlib are tedious.

So I came up with an easier way to do data analysis, which is Python crawl +BI analysis. What is BI, I do not need to introduce more, Python powerful data acquisition capabilities, combined with agile BI simple and fast data visualization operations, analysis results that will certainly be great!

So let’s take a look at the secret behind zhihu’s annual average salary of 985 million yuan. Keep your mouth shut and start crawling!

What data do we want?

The schools and companies of Zhihu users must be the first to bear the brunt. I want to see if these people are making it up or if it’s real.

Gender, occupation, geographic location, activity level, etc.

Second, the process of crawling

Zhihu now uses HTTPS request, data encryption, but the problem is not big, the important thing is that the web data is changed, and the background will make some judgment on crawler during the request, so every request needs to add request header, as close as possible to the appearance of browser request.

Once you have the source for the list page, you can get a link for each question:

There are 20 questions on each page, so you can get links to 20 questions, followed by the treatment of each question:

To get to this point, all that remains is the loop, the judgment, and the details.

The code for the final section is as follows:

Import requests import pandas as pd import time headers={'authorization': ",# Fill in your own authentication information here User_data = [] def get_user_data(page): For I in range(page):# turn page url = 'https://www.zhihu.com/api/v4/members/excited-vczh/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender% 2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset={}&limit=20'.format(i *20) response = requests.get(url, Json ()['data'] user_data.extend(response) # add response to user_data print(' page %s' % STR (I +1)) If __name__ == '__main__': Get_user_data (10) df = pd.datafame. From_dict (user_data)# Df.to_csv ('zhihu. CSV ',encoding='utf_8_sig')# save to the CSV file named zhihu,encoding='utf_8_sig'Copy the code

More source code at the end of the article!

Instead of a thread pool in my Python code, I used the first 10 **main () ** methods to crawl, which is 10 processes that crawled 57W + data over 4 hours.

3. Use BI for data visualization analysis

Now that we are in the final stage of data visualization with BI, the moment of revelation is approaching.

There are many kinds of BI tools on the market, foreign Tableau and domestic FineBI are the leader in the FIELD of BI, but it has been heard that Tableau is suitable for basic data analysts, very unfriendly to xiaobai. In addition, I happened to see IDC’s report the day before yesterday and found that The market share of Fensoft ranked first. In order to avoid redoing, I chose FineBI, an agile tool, and the fact proved that MY choice was right.

Firstly, download FineBI from the official website. Although FineBI is an enterprise-level data analysis platform, it is free for individuals forever. At the end of this article, we have prepared a download link ~

Add SQL data set (or add table directly) by using FineBI’s data configuration function to check and verify that the data that you just climbed and entered has actually been successfully entered into MySQL.

I forgot to mention that one of the hallmarks of FineBI is self-help analysis. What is self-help analysis? If I drag and drop the data myself, IT will have the same effect as Matplotlib. You might also think of Excel, but Excel can’t do anything beyond tens of thousands of lines of data. However, FineBI can still handle big data smoothly, with efficiency dozens of times or even hundreds of times higher.

Meanwhile, VBA’s Achilles heel is that it can only be automated within Excel, and nothing else.

Before writing this article, I analyzed housing prices and sales, and made it into a GIF for your reference:

I have a learning and communication circle here. Although the number of people is not very large, friends who are learning Python come together to learn and communicate, and discuss with each other about any problems they encounter and exchange academic issues. Learn to communicate penguin colony:745895701

Iv. Data visualization of Zhihu

FineBI’s dashboard can be dragged and dragged to adjust the position of components, and with various types of bar charts, pie charts and radar charts, data visualization is so easy that you can only imagine it without it.

1. Which city has the most Zhihu users?

As can be seen from the cloud ci-graph, the more prosperous the city, the more users zhihu has (the larger the text, the larger the proportion). Therefore, we can also see that the four first-tier cities of Beijing, Shanghai, Guangzhou and Shenzhen are in the center, followed by the new first-tier cities. In other words, most of the people who know hu are in the first-tier cities or the new first-tier cities.

Take a look at the rankings:

Hangzhou in the third, as expected one of the origins of the Internet is not blowing, Ali netease played a great role, why so say? Wait till you see the job.

2. Which schools are they from?

You see, you see, this record of formal schooling is really very high, who says per capita 985 is blowing?

Unsurprisingly, though, Zhihu’s main focus is on high-intellectual hubs, and students spend more time on their phones than office workers.

Now that we have analyzed schools, we must take a look at the proportion of male and female students playing Zhihu in various colleges and universities:

I don’t have to tell you to guess, blue represents boys, girls are either shopping, or studying, lower their heads to play with mobile phones must be boys hahaha (although I am male).

Let’s take a look at which universities in each region are heavy users of Zhihu. The darker the color is, the more Zhihu users the school has:

Don’t say, know hu per capita 985 real hammer, I shed tears of envy, I want to ask the classmate, is how to achieve play and study at the same time take into account? If you teach teach me, the admission mark line of my college entrance examination distance Tsinghua may be a bit closer….

3. Occupation proportion of Zhihu

After excluding the students, we found that the people of Zhihu were all….

Product manager is the most popular job in recent years, isn’t it? But have you finished your paperwork yet? All right? Are you not satisfied with the page interaction of Zhihu? Why don’t you go to work?

It can be seen that in addition to some common positions in Internet companies, teachers and lawyers also account for a large proportion of Users in Zhihu.

We also use a heat map to observe the distribution of zhihu’s mainstream occupations (top four) in different regions. The darker the color is, the more people there are in this occupation:

conclusion

I have analyzed so much, not to tell you how zhihu users really are, but to say that FineBI is indeed a very useful tool for data analysis, both for individuals and enterprises.

Of course, this is just the tip of the FineBI iceberg, and you’ll have to explore more for yourself.