The original | 10 Python Pandas tricks that make your work more efficient

The author | Shiu – Tang Li

The original | towardsdatascience.com/10-python-p…

The translator | kbsc13 (” algorithm ape to the growth of the public, “the author)

Statement | translation is for the exchange of learning purpose, welcome to reprint, but please keep this paper for, do not used for commercial or illegal purposes

takeaway

Pandas is a Python library that is widely used in areas such as data analysis. There are many tutorials on it, but here are some of the less popular but very useful tips.

read_csv

This is a function that everyone should know, because it’s how you read a CSV file.

However, if you need to read a large amount of data, you can add a parameter — nROWS =5 — to load a small amount of data first. This avoids the use of the wrong delimiter, since not all of it is comma-separated, and then load the entire dataset.

Ps. On a Linux terminal, you can run the head command to view the first five lines of a file. The command example is as follows:

head -n 5 data.txt
Copy the code

Df.columns. Tolist (), usecols=[‘c1′,’c2’,…] To read the columns you really need. Dtype ={‘c1’: STR, ‘c2’:int,… dtype={‘c1’: STR, ‘c2’:int,… }, another advantage of using this parameter is that for columns that contain different types, such as strings and integers, this parameter can specify that the column is the type of either string or integer, avoiding errors when using this column as a key to merge different tables.

Select_dtypes

If you have to do data preprocessing in Python, using this method can save some time. After reading the table, the default data types can be bool, INT64, float64, Object, category, timeDelta64, dateTime64, The first way to look at the distribution and know what data types are in the dataframe is as follows:

df.dtypes.value_counts()

Then use the following method to select specific types of data, such as numeric characteristics:

df.select_dtypes(include=['float64', 'int64'])

copy

This method is important, so let’s start with the following example:

import pandas as pd
df1 = pd.DataFrame({ 'a': [0.0.0].'b': [1.1.1]})
df2 = df1
df2['a'] = df2['a'] + 1
df1.head()
Copy the code

After running the above code, you can see that the value of df1 is changed, because df2=df1 does not copy df1 and assign df2, but sets a pointer to df1. So any changes to DF2 will change DF1. To fix this, use the following code:

df2 = df1.copy()
Copy the code

or

from copy import deepcopy
df2 = deepcopy(df1)
Copy the code

map

This is a cool command that can be used to do simple data conversion operations. First, we need to define a dictionary whose keys are old values and whose values are new values, as follows:

level_map = {1: 'high'.2: 'medium'.3: 'low'}
df['c_level'] = df['c'].map(level_map)
Copy the code

A few more examples:

  • Boolean True, False converts to 1,0
  • Define the level
  • User-defined dictionary encoding

apply or not apply

The apply method is useful if we want to create a new column that takes other columns as input:

def rule(x, y):
    if x == 'high' and y > 10:
         return 1
    else:
         return 0
df = pd.DataFrame({ 'c1': ['high' ,'high'.'low'.'low'].'c2': [0.23.17.4]})
df['new'] = df.apply(lambda x: rule(x['c1'], x['c2']), axis =  1)
df.head()
Copy the code

In this code, we define a method with two input parameters, and then apply it to the two columns c1 and c2 of df using the apply method.

The problem with Apply is that it is sometimes too slow. If you want to calculate the maximum value of c1 and c2, you can write:

df['maximum'] = df.apply(lambda x: max(x['c1'], x['c2']), axis = 1)
Copy the code

But you’ll notice it’s much slower than this:

df['maximum'] = df[['c1'.'c2']].max(axis=1)
Copy the code

Bottom line: Don’t use apply if you can do the job with other built-in functions. Round (df[‘c’], o) or df[‘c’].round(o) instead of using the apply method: df.apply(lambda x: round(x[‘c’], 0), axis=1)

value_counts

This method is used to check the distribution of values. For example, if you want to know the frequency and possible values of each unique value in column C, you could do something like this:

df['c'].value_counts()
Copy the code

Here are some other interesting tricks or parameters:

  1. Normalize =True: If you want to see frequency instead of times, use this parameter;
  2. Dropna =False: View statistics containing missing values
  3. df['c'].value_counts().reset_index(): If you want to convert this statistic to onedataframeAnd operate on it
  4. df['c'].value_counts().reset_index().sort_values(by='index')Or is itdf['c'].value_counts().sort_index(): Implements sorting statistics tables according to each value of the column

number of missing values

When building the model, we want to be able to remove rows with too many missing values, or rows with all missing values. This can be done by using.isnull() and.sum() to count the number of missing values for a particular column:

import pandas as pd
import numpy as np
df = pd.DataFrame({ 'id': [1.2.3].'c1': [0.0,np.nan], 'c2': [np.nan,1.1]})
df = df[['id'.'c1'.'c2']]
df['num_nulls'] = df[['c1'.'c2']].isnull().sum(axis=1)
df.head()
Copy the code

select rows with specific IDs

In SQL this operation can be done by SELECT * FROM… WHERE ID in (‘ A001 ‘, ‘C022’,…) To get records for a specific IDs. In PANDAS, it can look like this:

df_filter = df['ID'].isin(['A001'.'C022'. ] ) df[df_filter]Copy the code

Percentile groups

Suppose you have a column full of numeric types, and you want to divide the values into groups, such as the first 5% for group 1, 5-20% for group 2, 20-50% for group 3, and the last 50% for group 4. This can be done using the.cut method, but there is another option:

import numpy as np
cut_points = [np.percentile(df['c'], i) for i in [50.80.95]]
df['group'] = 1
for i in range(3):
    df['group'] = df['group'] + (df['c'] < cut_points[i])
# or <= cut_points[i]
Copy the code

This method is very fast.

to_csv

Finally, a very common method is to save it as a CSV file. Here are two tips:

Print (df[:5].to_csv()) print(df[:5].to_csv()) print the first five lines and save the data to a file.

Another trick is to deal with mixed integers and missing values. When a column has both missing values and integers, the data type is float, not int. So when exporting the table, you can add the parameter float_format=’%. Of ‘to convert the float type to an integer. If you just want integers, you can remove the.o from this code


Welcome to follow my wechat official account — the growth of algorithmic ape, or scan the QR code below, we can communicate, learn and progress together!

Recommended reading:

  • Amazing sorting skills of data scientists
  • In this paper, we learn about the coding method of category features
  • Quick start Jupyter Notebook
  • Jupyter advanced tutorial
  • Multi-threaded and multi-process operations can be implemented with just a few lines of code