Pandas is a Library of Python software that provides a large number of functions and methods that allow you to work with data quickly and easily. Pandas is one of the factors that make Python a powerful and efficient data analysis environment in general. In this article, the authors present 23 Pandas core methods from the perspectives of basic data set reading and writing, data processing, and DataFrame operations.
Pandas is a library built on and enhanced by NumPy. It is also an open source project. Based on Cython, it is very fast to read and process data, and can easily handle missing data in floating-point data (denoted as NaN) as well as non-floating-point data. In this paper, basic data set operation mainly introduces CSV and Excel reading and writing methods, basic data processing mainly introduces missing values and feature extraction, and finally DataFrame operation mainly introduces functions and sorting methods.
Basic data set operations
(1) Read the data set in CSV format
Pd. DataFrame. From_csv (” csv_file “)
Or:
Pd. Read_csv (” csv_file “)
(2) Read Excel data set
pd.read_excel(“excel_file”)
(3) Write the DataFrame directly to the CSV file
The following uses a comma as a delimiter without an index:
df.to_csv(“data.csv”, sep=”,”, index=False)
(4) Basic data set characteristic information
df.info()
(5) Basic data set statistics
print(df.describe())
(6) Print data frame in a table
Output DataFrame to a table:
print(tabulate(print_table, headers=headers))
When “print_table” is a list, where the list element is a new list, “headers” is a list of header strings.
(7) List the names of all columns
df.columns
Basic data processing
(8) Delete the missing data
df.dropna(axis=0, how=’any’)
Return a DataFrame in which a given axis containing any NaN values is removed. Selecting how= “all” removes a given axis in which all elements are NaN.
(9) Replace missing data
df.replace(to_replace=None, value=None)
Use the value instead of the to_replace value in the DataFrame, where both value and to_replace need to be assigned different values.
(10) Check for null NaN
pd.isnull(object)
Check for missing values, namely NaN in the numeric array and None/NaN in the target array.
(11) Delete features
df.drop(‘feature_variable_name’, axis=1)
Axis selects 0 for rows and selects for columns.
(12) Convert the target type to floating point
pd.to_numeric(df[“feature_name”], errors=’coerce’)
Perform the computation further by converting the target type to a numeric value, in this case a string.
(13) Convert DataFrame to NumPy array
df.as_matrix()
(14) Set DataFrame to line N
df.head(n)
(15) Fetching data through feature names
df.loc[feature_name]
DataFrame operation
(16) Use the DataFrame function
This function multiplies all values in the DataFrame line “height” by 2:
df[“height”].apply(*lambda* height: 2 * height)
Or:
def multiply(x):
return x * 2
df[“height”].apply(multiply)
(17) Rename rows
The following code renames the third behavior of the DataFrame “size” :
df.rename(columns = {df.columns[2]:’size’}, inplace=True)
(18) Take the unique entity of a row
The following code takes the unique entity on the “name” line:
df[“name”].unique()
(19) Access child DataFrame
The following code extracts the selected lines “name” and “size” from the DataFrame:
new_df = df[[“name”, “size”]]
(20) Summarize data information
# Sum of values in a data frame
df.sum()
# Lowest value of a data frame
df.min()
# Highest value
df.max()
# Index of the lowest value
df.idxmin()
# Index of the highest value
df.idxmax()
# Statistical summary of the data frame, with quartiles, median, etc.
df.describe()
# Average values
df.mean()
# Median values
df.median()
# Correlation between columns
df.corr()
# To get these values for only one column, just select it like this#
df[“size”].median()
(21) Sort the data
df.sort_values(ascending = False)
(22) Boolean index
The following code will filter lines named “size” and display only lines with a value of 5:
df[df[“size”] == 5]
(23) Select a specific value
The following code will select the value of the “size” column, the first line:
df.loc([0], [‘size’])