In the process of contact with many students, I found that a difficulty in self-learning Python data analysis was that there were too many materials and they were too complicated. Most of the materials on the web start with Python syntax and are mixed in with a lot of Python development knowledge. It takes a lot of time but it’s not clear what’s really useful. I thought I would be able to write a crawler and draw a map, but I spent weeks and weeks in the process of looking at the basics, so that many people who are motivated to learn Python sacrificed a step before getting started.

If you are interested in learning Python, you need to learn the dry food.

So, I summarized the following dry goods, to help you clear your mind, improve learning efficiency. There are three main parts: the syntax you need to know to do Python data analysis, how to implement crawlers, and how to do data analysis.

1. Two sets of basic Python terms you must know

A. Variables and assignments

Python can define variable names and assign values directly. For example, when we write a = 4, the Python interpreter does two things:

  • An integer value of 4 is created in memory
  • Create a variable named A in memory and point it to 4

Use a diagram to show the main points of Python variables and assignments

For example, in the code below, “=” is used to assign a value, and Python automatically recognizes the data type:

A =4 # integer data

B =2 # integer

C = “4” # string data

D = “2” # string data

Print (” a+b “, “a+b”)# print(” a+b “, “a+b”

Print (” c+d = “, c+d)

The following is the result of the run

A plus b is equal to 6

C plus d is 42

Read the code and comments in the code block, and you’ll find That Python is extremely easy to read and understand.

B. Data type

There are three types of data that are common in elementary data analysis:

Dic (Python) dictionary dic (Python) DataFrame

They read like this:

List:

# list

,2.223 liebiao = [1, 3, ‘liu’, ‘ZhangZe day’, ‘jay’, ‘he ling, [‘ weibo’, ‘B station’, ‘trill]]

A list is an ordered collection of elements that can be any of the previously mentioned data formats and data types (integer, floating point, list…). , and you can specify the order in which the elements are added at any time, in the form:

#ist is a mutable ordered list, so we can append elements to the end of the list:

Liebiao. Append () ‘thin’

ptint(liebiao)

Result # 1

[1, 2.223, 3, ‘liu’, ‘ZhangZe day’, ‘jay’, ‘he ling, [‘ weibo’, ‘B station’, ‘trill],’ thin ‘]

We can also insert the element “fat” at the specified position, such as the index 5:

Liebiao. Insert (5, ‘fat’)

Result # 2

[1, 2.223, 3, ‘liu’, ‘ZhangZe day’, ‘fat’, ‘jay’, ‘he ling, [‘ weibo’, ‘B station’, ‘trill],’ thin ‘]

Dict:

# dictionary

Zidian = {‘ liu ‘:’ and ‘, ‘ZhangZe day’ : ’36’, ‘jay’ : ‘, 40 ‘, ‘he ling’ : ’26’}

Dictionaries use key-value ** storage, unordered, with extremely fast lookup speed. For example, if you want to know jay Chou’s age quickly, you can write:

Zidian [‘ Jay Chou ‘]

’40’

The order in which the dict is stored has no relation to the order in which the key is placed, meaning that “Zhang Zetian” is not after “Liu Qiangdong”.

DataFrame:

DataFrame is simply a table format in Excel. After importing the pandas package, the dictionary and list can be converted to a DataFrame.

import pandas as pd

Df = pd.datafame. From_dict (zidian, Orient =’index’,columns=[‘age’])#

Df =df.reset_index().rename(columns={‘index’:’name’})#

As with Excel, any column or row of a DataFrame can be singled out for analysis.

These three data types are the most commonly used in Python data analysis, so that’s the end of the basic syntax, and it’s time to start writing functions that evaluate data.

2. Learn loop functions from Python

With these basic syntax concepts in hand, we should be ready to start learning some interesting functions. Let’s take the crawler’s inescapable traversal of the URL as an example and talk about the use of the loop function for, which is the most difficult to understand:

A.f the or function

The for function is a common loop function. Let’s start with a simple code to understand the purpose of the for function:

for key in zidian:

print(key)

liu

ZhangZe day

Jay Chou

Kun ling

Because dict is not stored in the same order as a list, it is likely that the order of the iterated results will not be the same every time. By default, dict iterates over keys. For value in d.values(), for k, v in d.values(), for k, v in d.values()

As you can see, the names in the dictionary are printed out. The for function is used to walk through the data. Mastering the for function is a real introduction to Python functions.

B. Crawlers and loops

The for function is often used in writing Python crawlers, because crawlers often need to traverse every web page to retrieve information, so building complete and correct links to web pages is critical. Take a box office data network for example, his website information looks like this:

The weekly box office JSON data address of this website can be found through the package capture tool, the website is www.cbooo.cn/BoxOffice/g…

If you look carefully, the website’s box office data url (URL) of different dates only changes after the date, visit different url (URL) can see the box office data of different dates:

All we need to do is ** traverse the url under each date, using Python code to crawl the data down. ** This is where the for function comes in handy, with which we can quickly generate multiple urls that meet the criteria:

Url_df = pd. DataFrame ({‘ urls: [‘ www.cbooo.cn/BoxOffice/g… ‘ for i in range(5)],’date’ :pd.date_range(20190114,freq = ‘W-MON’,periods = 5)})

The same portion of the url is generated five times, and the dates for the five Mondays are generated using pandas’ time series function.

It uses several of the data types provided in Part 1:

Range (5) belongs to the list,

‘urls’ : [] belongs to the dictionary,

Pd. Dataframe belongs to the dataframe

url_df[‘urls’] = url_df[‘urls’] + url_df[‘date’].astype(‘str’)

Slide the slider to see the full code and comments in the middle.

To facilitate understanding, I have drawn a schematic diagram of the traversal process of the for function:

The subsequent crawling process is omitted here, and related crawler codes are shown at the end of this paper. We used crawlers to retrieve 5,800 + pieces of data, including 20 fields, covering the information of single week box office, cumulative box office, audience times, average attendance, average ticket price, month-on-month change of times and so on from January 2008 to February 2019.

3. How does Python implement data analysis?

In addition to crawlers, parsing data is one of the most important uses of Python. How can Python do what Excel can do? Can Python do what Excel can’t? ** Using box office data, we give an example respectively:

Amy polumbo ython analysis

After data collection and import, selecting fields for preliminary analysis can be said to be the only way of data analysis. This step is made easy with the help of the Dataframe data format.

For example, if you want to see the number one movie in the week, you can use the method used in pandas to filter out the number one movie in the week and keep the number one movie in the week.

Data = pd.read_csv(‘ 20071-20192. CSV ‘,engine=’python’)

Data [data[‘ average attendance ‘]>20]

# Calculate the change of weekend box office number one over time, import the data, and select films with average attendance of more than 20 as valid data

DataTop1_week = data [data [‘ rank ‘] = = 1], [[‘ movie name ‘, ‘week’]]

Select * from # 1 box office of the week and keep the “name” and “box office of the week” columns

DataTop1_week = datatop1_week.groupby (‘ movie name ‘).max()[‘ week box office ‘].reset_index()

# Use “movie name” to group data. If the same movie topped the charts continuously, the biggest weekly box office will be retained, and other data will be deleted

DataTop1_week = datatop1_week. sort_values(by=’ weekend ‘, Ascending =False)

# Sort the data by “weekly box office” in descending order

Datatop1_week. index = dataTop1_week[‘ movie name ‘]

Del dataTop1_week[‘ Movie name ‘]

# change the index column to the movie name and delete the original movie name

dataTop1_week

# view data

In 9 lines of code, we completed PivotTable, drag, sort and other mouse click actions in Excel. Finally, use the visualization package Matplotlib in Python to quickly draw the graph:

B. Functional analysis

This is a simple statistical analysis process. Here’s what basic Excel features can’t do: custom function enhancements. It can be found by observing the data that the weekly box office and the ranking of the total box office are recorded in the data, so can the code that just calculated the weekly box office ranking be reused for a total box office analysis?

Create a custom function using def and the code you just wrote, and specify the rules for the function:

def pypic(pf):

Define a pypic function pf

DataTop1_sum = data[[‘ movie name ‘,pf]]

Select * from source data where column name is “movie name” and column name is pf

DataTop1_sum = datatop1_sum.groupby (‘ movie name ‘).max()[pf].reset_index()

# Use “movie name” to group data. If the same movie continues to top the charts, select the largest PF box office and delete other data

dataTop1_sum = dataTop1_sum.sort_values(by=pf,ascending=False)

Sort the data in descending order by PF

Datatop1_sum. index = dataTop1_sum[‘ movie name ‘]

Del dataTop1_sum[‘ movie name ‘]

DataTop1_sum [20] iloc [r]. : : 1. The plot barh (figsize = (6, 10), color = ‘orange’)

Name = pf + ‘top20 analysis’

plt.title(name)

Draw graphs based on function variable names

After defining the function, batch out the graph so easy:

By learning how to build functions, a data analyst can truly leave the mouse click mode of Excel and step into the field of efficient analysis.

4. You can never get a foot in the door by watching

If you only have an hour to learn, these are the Python essentials you must know. If you are interested in learning Python, you will always be a layman.

If you do meet a good coworker, you’ll be lucky. Come on! Q Group :766610200 Includes Python, PythonWeb, crawler, data analysis and other Python skills, as well as artificial intelligence, big data, data mining, automation and other learning methods. Build from zero to the project development hands-on combat all-round analysis!

Click: Join