Hi, everyone. I’m Dapeng, the co-founder of Urban Data Cluster, dedicated to the application and teaching of Python data analysis and data visualization.

\

In the process of communicating with many students, I found that a difficulty in self-learning Python data analysis was that there were too many materials and they were too complicated. Most of the materials on the web start with Python syntax and are mixed in with a lot of Python development knowledge. It takes a lot of time but it’s not clear what’s really useful. I thought I would be able to write a crawler and draw a map, but I spent weeks and weeks in the process of looking at the basics, so that many people who are motivated to learn Python sacrificed a step before getting started.

\

\

So, I summarized the following dry goods, to help you clear your mind, improve learning efficiency. There are three main parts: the syntax you need to know to do Python data analysis, how to implement crawlers, and how to do data analysis.

\

1. Two basic Python terms you must know \

\

A. Variables and assignments

Python can define variable names and assign values directly. For example, when we write a = 4, the Python interpreter does two things:

\

  • An integer value of 4 is created in memory
  • Create a variable named A in memory and point it to 4

\

Use a diagram to illustrate the main points of Python variables and assignments:

* * * *

For example, in the code below, “=” is used to assign a value, and Python automatically recognizes the data type:

Print (" a+b ", "c+d") print(" c+d ", "c+d") print(" c+d ", "c+d") A + B = 6 >>> C +d = 42 Please read the code and comments in the code block. You will find That Python is extremely readable and easy to understand.Copy the code
Copy the code

B. Data type

There are three types of data that are common in elementary data analysis:

  • List list (built-in in Python)
  • Dic dictionary (built in Python)
  • DataFrame specifies the type of data stored in the software package pandas.

\

They read like this:

\

List:

# list liebiao = [,2.223, 1-3, 'liu', 'ZhangZe day', 'jay', 'he ling, [' weibo', 'B station', 'trill]]Copy the code

A list is an ordered collection of elements that can be any of the previously mentioned data formats and data types (integer, floating point, list…). , and you can specify the order in which the elements are added at any time, in the form:

#ist is a mutable ordered list, so we can append elements to the end of the list: Liebiao. Append (' thin ') print (liebiao) result # 1 > > > [1, 2.223, 3, 'liu', 'ZhangZe day', 'jay', 'he ling, [' weibo', 'B station', 'trill']. [' thin '] # insert the element at the specified position, such as the index 5, insert the element "fat" : Liebiao. Insert (5, 'fat') print (liebiao) result # 2 > > > [1, 2.223, 3, 'liu', 'ZhangZe day', 'fat', 'jay', 'he ling, [' weibo', 'B station', 'trill],' thin ']Copy the code

* * * *

Dict: \

# dictionary zidian = {' liu ':' and ', 'ZhangZe day' : '36', 'jay' : ', 40 ', 'he ling' : '26'}Copy the code

Dictionaries use key-value storage, unordered, with extremely fast lookup speed. For example, if you want to know jay Chou’s age quickly, you can write:

Zidian [' Jay '] >>>'40'Copy the code

\

The order in which the dict is stored has no relation to the order in which the key is placed, meaning that “Zhang Zetian” is not after “Liu Qiangdong”.

* * * *

DataFrame:

DataFrame is simply a table format in Excel. After importing the pandas package, the dictionary and list can be converted to a DataFrame.

Import pandas as pd df= pd.datafame.from_dict (zidian, Orient ='index',columns=['age'])# Df =df.reset_index().rename(columns={'index':'name'})#Copy the code

As with Excel, any column or row of a DataFrame can be singled out for analysis. \

\

These three data types are the most commonly used in Python data analysis, so that’s the end of the basic syntax, and it’s time to start writing functions that evaluate data.

\

2. Loop function \ from Python herpetology

\

With these basic syntax concepts in hand, we should be ready to start learning some interesting functions. Let’s take the crawler’s inescapable traversal of the URL as an example and talk about the use of the loop function for, which is the most difficult to understand:

A.f the or function

The for function is a common loop function. Let’s start with a simple code to understand the purpose of the for function:

Zidian = {' liu ':' and ', 'ZhangZe day' : '36', 'jay' : ', 40 ', 'he ling' : '26'} for key in zidian: print (key) > > > liu ZhangZe days Jay Chou kun lingCopy the code

Because dict is not stored in the same order as a list, it is likely that the order of the iterated results will not be the same every time. By default, dict iterates over keys. For value in d.values(), for k, v in d.values() \ for k, v in d.values() \

\

As you can see, the names in the dictionary are printed out. The for function is used to walk through the data. Mastering the for function is a real introduction to Python functions.

\

B. Crawlers and loops

The for function is often used in writing Python crawlers, because crawlers often need to traverse every web page to retrieve information, so building complete and correct links to web pages is critical. Take a box office data network for example, his website information looks like this:

\

The site of the week the json data address can be found through caught tools, http://www.cbooo.cn/BoxOffice/getWeekInfoData?sdate=20190114

\

If you look carefully, the website’s box office data url (URL) of different dates only changes after the date, visit different url (URL) can see the box office data of different dates:

What we’re going to do is walk through the url under each date, using Python code to crawl the data down. This is where the for function comes in. We can use it to quickly generate multiple urls that meet the criteria:

\

import pandas as pd url_df = pd.DataFrame({'urls':['http://www.cbooo.cn/BoxOffice/getWeekInfoData?sdate=' for i in Range (5)],'date' :pd.date_range(20190114,freq = 'W-mon ',periods = 5)}) "" It uses multiple data types provided in Part 1: range(5) belongs to the list, 'urls' : [] dataframe dataframe "" url_df['urls'] = url_df['urls'] + url_df['date'].astype(' STR ')Copy the code

Slide the slider to see the full code and comments in the middle.

To facilitate understanding, I have drawn a schematic diagram of the traversal process of the for function:

The subsequent crawling process is omitted here, and related crawler codes are shown at the end of this paper. We used crawlers to retrieve 5,800 + pieces of data, including 20 fields, covering the information of single week box office, cumulative box office, audience times, average attendance, average ticket price, month-on-month change of times and so on from January 2008 to February 2019.

\

3. How does Python implement data analysis? \

\

In addition to crawlers, parsing data is also an important use of Python. How can Python do what Excel can do? Can Python do what Excel can’t? Using movie box office data, we respectively take an example to illustrate: \

\

Amy polumbo ython analysis

After data collection and import, selecting fields for preliminary analysis can be said to be the only way of data analysis. This step is made easy with the help of the Dataframe data format.

\

For example, if you want to see the number one movie in the week, you can use the method used in pandas to filter out the number one movie in the week and keep the number one movie in the week.

Import pandas as pd data = pd.read_csv(' 20071-20192.csv',engine='python') data[data[' average number of seats ']>20] DataTop1_week = data[data[' ranking ']==1][[' name ',' weekend box office ']] DataTop1_week = datatop1_week.groupby (' movie name ').max()[' week box office '].reset_index() # select * from 'movie name'; If the same movie continues to top the charts, choose the biggest weekly box office reservation. Delete dataTop1_week = datatop1_week. sort_values(by=' weekend box office ',ascending=False DataTop1_week [' movie name '] del dataTop1_week[' movie name '] # delete the original movie name dataTop1_weekCopy the code

\

\

In 9 lines of code, we completed PivotTable, drag, sort and other mouse click actions in Excel. Finally, use the visualization package Matplotlib in Python to quickly draw the graph:

\

\

B. Functional analysis

This is a simple statistical analysis process. Here’s what basic Excel features can’t do: custom function enhancements. It can be found by observing the data that the weekly box office and the ranking of the total box office are recorded in the data, so can the code that just calculated the weekly box office ranking be reused for a total box office analysis?

\

Create a custom function using def and the code you just wrote, and specify the rules for the function:

def pypic(pf): Pypic dataTop1_sum = data[[' movie name ',pf]] DataTop1_sum = datatop1_sum.groupby (' movie name ').max()[pf].reset_index() # select * from 'movie name' select * from 'movie name' DataTop1_sum = datatop1_sum. sort_values(by=pf,ascending=False DataTop1_sum [' movie name '] del dataTop1_sum[' movie name '] DataTop1_sum [:20].iloc[::-1].plot.barh(figsize = (6,10),color = 'orange') name=pf+'top20 'plt.title(name) dataTop1_sum[:20].iloc[::-1].plot. Draw graphs based on function variable namesCopy the code
Copy the code

After defining the function, batch out the graph so easy:

\

By learning how to build functions, a data analyst can truly leave the mouse click mode of Excel and step into the field of efficient analysis.

\

You can never get to the beginning by looking without practicing

\

If you only have an hour to learn, these are the Python essentials you must know. If you are interested in learning Python data analysis, but feel confused in the process, welcome to join my free live broadcast of the Class on netease Cloud. Every night, you will learn and practice a topic, so that you can get a quick introduction to Python data analysis:

\

Non-stop live with you from the entry to the master

* * * *

Scan the code to reserve a free live broadcast

* * * *

5.7 PM on Tuesday

Quick Start: The Top 10 Things you can do Wrong when Learning Python

1. Learn basic Python syntax

2. What are the top ten mistakes?

3. Implement first data crawler in Python

\

5.8 PM on Wednesday

2. Farewell to Overtime: Replacing Excel for Pandas

1. How to use Python to process data quickly?

2. The most common mistakes beginners make when using pandas

3

\

5.9 PM on Thursday

“Poor and Rich Are 1% Apart in Effort: Using Random Numbers to simulate social Wealth Distribution.”

1. What is The Monte Carlo idea

2. Premise of random number simulation: accurately judge data distribution

3. Model construction to simulate social wealth distribution

\

5.13 PM on Monday

Python Data Visualization tools: Pyecharts!

1. Why do we need interactive charts for data presentation?

2. Pyecharts basic operations

3. Detailed interpretation of data visualization skills

\

5.14 PM on Tuesday

1 hour introduction to A Python crawler: Be a Data analyst and Crawl your Own data!

1. Read the structure

2. Page parsing and label extraction

3. Implement the first data crawler

\

5.15 PM on Wednesday

Using Data as a walkthrough: Finding the most interesting Places in a City

1. Data crawler construction

2. Field filtering and data cleaning

3. Screening mechanism and evaluation method

4. Visualized expression results of spatial data

\

5.16 PM on Thursday

Population Data: The Turnover of Workers in Shanghai in the past Year

1. National migration data collection

2. Data collation and core city screening

3. Data expression: OD chart making methods and skills

\

Give free lessons \

Course outline

1. Getting started with Python

2. Monthly net income model construction

3. Monthly expenditure model construction

4. Simulation of repayment scenario of payback under different circumstances

5. Debt accumulation problem

6. How do charts tell a good story? (eggs)

\

6G Introductory materials

The box office data crawl and analysis code as well as the Chinese box office data source data shown in this paper have been placed in baidu webdisk 6G information pack

\

Way of receiving Welfare

* * * *

All the above benefits, scan code to add netease cloud classroom assistant can be obtained

Wechat id: Neteasepython

Seats are limited on a first come first served basis