Pandas is a Library of Python software that provides a large number of functions and methods that allow you to work with data quickly and easily. Pandas is one of the factors that make Python a powerful and efficient data analysis environment in general. In this article, the authors present 23 Pandas core methods from the perspectives of basic data set reading and writing, data processing, and DataFrame operations.







Pandas is a library built on and enhanced by NumPy. It is also an open source project. Based on Cython, it is very fast to read and process data, and can easily handle missing data in floating-point data (denoted as NaN) as well as non-floating-point data. In this paper, basic data set operation mainly introduces CSV and Excel reading and writing methods, basic data processing mainly introduces missing values and feature extraction, and finally DataFrame operation mainly introduces functions and sorting methods.

Basic data set operations

(1) Read the data set in CSV format

Pd. DataFrame. From_csv (” csv_file “)

Or:

Pd. Read_csv (” csv_file “)

(2) Read Excel data set

pd.read_excel(“excel_file”)

(3) Write the DataFrame directly to the CSV file

The following uses a comma as a delimiter without an index:

df.to_csv(“data.csv”, sep=”,”, index=False)

(4) Basic data set characteristic information

df.info()

(5) Basic data set statistics

print(df.describe())

(6) Print data frame in a table

Output DataFrame to a table:

print(tabulate(print_table, headers=headers))

When “print_table” is a list, where the list element is a new list, “headers” is a list of header strings.

(7) List the names of all columns

df.columns

Basic data processing

(8) Delete the missing data

df.dropna(axis=0, how=’any’)

Return a DataFrame in which a given axis containing any NaN values is removed. Selecting how= “all” removes a given axis in which all elements are NaN.

(9) Replace missing data

df.replace(to_replace=None, value=None)

Use the value instead of the to_replace value in the DataFrame, where both value and to_replace need to be assigned different values.

(10) Check for null NaN

pd.isnull(object)

Check for missing values, namely NaN in the numeric array and None/NaN in the target array.

(11) Delete features

df.drop(‘feature_variable_name’, axis=1)

Axis selects 0 for rows and selects for columns.

(12) Convert the target type to floating point

pd.to_numeric(df[“feature_name”], errors=’coerce’)

Perform the computation further by converting the target type to a numeric value, in this case a string.

(13) Convert DataFrame to NumPy array

df.as_matrix()

(14) Set DataFrame to line N

df.head(n)

(15) Fetching data through feature names

df.loc[feature_name]

DataFrame operation

(16) Use the DataFrame function

This function multiplies all values in the DataFrame line “height” by 2:

df[“height”].apply(*lambda* height: 2 * height)

Or:

def multiply(x):

return x * 2

df[“height”].apply(multiply)

(17) Rename rows

The following code renames the third behavior of the DataFrame “size” :

df.rename(columns = {df.columns[2]:’size’}, inplace=True)

(18) Take the unique entity of a row

The following code takes the unique entity on the “name” line:

df[“name”].unique()

(19) Access child DataFrame

The following code extracts the selected lines “name” and “size” from the DataFrame:

new_df = df[[“name”, “size”]]

(20) Summarize data information

# Sum of values in a data frame

df.sum()

# Lowest value of a data frame

df.min()

# Highest value

df.max()

# Index of the lowest value

df.idxmin()

# Index of the highest value

df.idxmax()

# Statistical summary of the data frame, with quartiles, median, etc.

df.describe()

# Average values

df.mean()

# Median values

df.median()

# Correlation between columns

df.corr()

# To get these values for only one column, just select it like this#

df[“size”].median()

(21) Sort the data

df.sort_values(ascending = False)

(22) Boolean index

The following code will filter lines named “size” and display only lines with a value of 5:

df[df[“size”] == 5]

(23) Select a specific value

The following code will select the value of the “size” column, the first line:

df.loc([0], [‘size’])